GUI agent – arXiv最新論文の紹介

GUIPilot: A Consistency-based Mobile GUI Testing Approach for Detecting Application-specific Bugs

GUIPilot: A Consistency-based Mobile GUI Testing Approach for Detecting Application-specific Bugs [16.9]
モバイル設計と実装の矛盾を検出するためのGUIPilotを提案する。実験の結果,GUIPilotは画面不整合の検出において94.5%の精度と99.6%のリコールを実現することができた。トレーディングモバイルアプリケーションにGUIPilotを適用するという産業ケーススタディは、GUIPilotが9つのアプリケーションバグを検出したことを示している。
論文参考訳（メタデータ） (Mon, 09 Jun 2025 03:09:48 GMT)
GUIテストのためのエージェントの提案。
リポジトリはGitHub – code-philia/GUIPilot: GUIPilot: A Consistency-based Mobile GUI Testing Approach for Detecting Application-specific Bugs

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills [57.7]
本稿では,知識不足の問題に対処するため,階層型マルチモーダルスキル(HMS)モジュールを提案する。トラジェクトリを実行スキル、コアスキル、そして最終的にはメタスキルに徐々に抽象化し、長期のタスク計画のための階層的な知識構造を提供する。ドメインギャップを埋めるために,Skill-Augmented Monte Carlo Tree Search (SA-MCTS)アルゴリズムを提案する。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 06:21:19 GMT)
「Hierarchical Multimodal Skills (HMS) module for long-horizon planning」、「A Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm for knowledge exploration in online settings.」をキーとするcross-platform, plug-and-play GUI agent、Mirage-1の提案
プロジェクトサイトはMirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent

GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent [66.3]
MLLMはUIコンポーネントの誤解釈と古い知識の2つの大きな問題に悩まされている。本稿では,2つの基本的なメカニズムを組み込んだトレーニング不要なGUIエージェントであるGUI-Explorerを提案する。 SPA-Benchでは53.7%、AndroidWorldでは47.4%のタスク成功率で、GUI-ExplorerはSOTAエージェントよりも大幅に改善されている。
論文参考訳（メタデータ） (Thu, 22 May 2025 16:01:06 GMT)
「(a) Automatically constructing function-aware exploration goals by analyzing structural information from the GUI environment, followed by systematic exploration to collect diverse function- aware trajectories. (b) Extracting effective screen-operation logic through unsupervised analysis of structured interaction triples (observation, action, outcome), enabling unsupervised knowledge extraction. (c) Performing visual-semantic retrieval between screen visuals and the knowledge vector store to construct Dynamic Guidance achieves dual objectives: preventing UI misinterpretation and ensuring action proposals align with actual UI states.」というメカニズムの提案。SPA-Bench、AndroidWorldのスコアを改善。
リポジトリはGitHub – JiuTian-VL/GUI-explorer: [ACL 2025] GUI-explorer: Autonomous Exploration and Mining of Transition-aware Knowledge for GUI Agent

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.1]
リッチなマルチモーダルWebチュートリアルから学習し,汎用GUIエージェントを構築するTongUIフレームワークを提案する。我々は、5つのオペレーティングシステムと200以上のアプリケーションにまたがる143Kトラジェクトリデータを含むGUI-Netデータセットを作成する。我々はGUI-Net上でQwen2.5-VL-3B/7Bモデルを微調整してTongUIエージェントを開発する。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 06:15:56 GMT)
WEBチュートリアルを活用したデータセット構築とfine tuningによるエージェント開発
プロジェクトサイトはTongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

UFO2: The Desktop AgentOS , UI-TARS-1.5

UFO2: The Desktop AgentOS [60.3]
UFO2はWindowsデスクトップ用のマルチエージェントAgentOSで、実用的なシステムレベルの自動化に発展している。我々は、20以上の現実世界のWindowsアプリケーションに対してUFO2を評価し、従来のCUAよりもロバスト性および実行精度を大幅に改善した。我々の結果は、ディープOSの統合によって、信頼性の高いユーザ指向のデスクトップ自動化へのスケーラブルな道が開けることを示している。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 13:04:43 GMT)
OS-COPILOT/FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You)とUFO（UI-Focused） – arXiv最新論文の紹介のバージョン2。AgentSもだが、バージョンが上がっていくのにこの分野の盛り上がりを感じる。bytedanceからはUI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介の次バージョンUI-TARS-1.5がでている（UI-TARS：Next-generation native GUI agent model designed to interact seamlessly with GUIs using human-like perception、下記）
リポジトリはGitHub – microsoft/UFO: The Desktop AgentOS.

Introducing UI-TARS-1.5
UI-TARS-1.5は、強力な視覚言語モデル上に構築されたオープンソースのマルチモーダルエージェントである。強化学習によって実現される高度な推論を統合する。さまざまな標準ベンチマークで最先端の結果が得られる。

Towards Trustworthy GUI Agents: A Survey

Towards Trustworthy GUI Agents: A Survey [64.6]
本調査では,GUIエージェントの信頼性を5つの重要な次元で検証する。敵攻撃に対する脆弱性、シーケンシャルな意思決定における障害モードのカスケードなど、大きな課題を特定します。 GUIエージェントが普及するにつれて、堅牢な安全基準と責任ある開発プラクティスを確立することが不可欠である。
論文参考訳（メタデータ） (Sun, 30 Mar 2025 13:26:00 GMT)
GUIエージェントの信頼性に関するサーベイ。整理軸は「Security」、「Reliability」、「Explainability」、「Ethical Alignment」、「Evaluation methodologies」

Inducing Programmatic Skills for Agentic Tasks

Inducing Programmatic Skills for Agentic Tasks [54.0]
本研究では,エージェントがプログラムベースのスキルをその場で誘導し,検証し,活用することで,エージェントの適応を可能にするエージェントスキル誘導(ASI)を提案する。 ASIは静的ベースラインエージェントとテキストスキルを23.5%、成功率11.3%で上回っている。
論文参考訳（メタデータ） (Wed, 09 Apr 2025 12:25:37 GMT)
「We present ASI, namely agent skill induction (§2), that induces and applies programmatic skills along the process of solving user web navigation queries. More concretely, given a natural language (NL) query, the agent first generates an action trajectory attempting to solve the task using built-in, primitive actions such as click and scroll.」という感じでスキルの表現にプログラムコードを用いる手法の提案と有効性の検証。
曖昧さを含め、表現力・抽象化の方法などかなり異なる自然言語と形式言語の使い分けが重要なのかなーと思わなくもない。
リポジトリはGitHub – zorazrw/agent-skill-induction: Agent Skill Induction: “Inducing Programmatic Skills for Agentic Tasks”

Agent S2, Devin 2, Amazon Nova Act, An Illusion of Progress? Assessing the Current State of Web Agents

以前取り上げたAgent Sのバージョン2が出ていた。半年でOS Worldのスコアが20.5から27.0（15Step）に上がっており、ベースモデル（LLM）の性能向上もあるだろうが着実な進化を感じる。Introducing Amazon Nova Act | Amazon AGI Labs、Cognition | Devin 2.0など発表が相次ぎGUI Agent的なLLM based Agentは流行している。

個人のサイトでもfugumt.comはFugu-MT:AgentでAgent化を行っている（OpenManusを使ったサイトへのエージェント組み込み | ぷるーふおぶこんせぷと）。容易にサイトの機能を拡張できることから、今後このようなサイトが増えてくるのではないかと思う(*1)。

そのような中「An Illusion of Progress? Assessing the Current State of Web Agents 」では「Surprisingly, many recent agents, except for Operator, do not outperform the simple SeeAct agent (Zheng et al , 2024) released in early 2024.」とも指摘されている。同論文にもある通り、正しい評価データセットやフレームワークが求められている。

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.3]
コンピュータ利用エージェントは、コンピュータやモバイルデバイスのグラフィカルユーザインタフェース(GUI)と直接対話することで、デジタルタスクを自動化する。本稿では,様々なジェネラリストおよびスペシャリストモデルにまたがって認知的責任を委譲する新しい構成フレームワークであるAgens S2を紹介する。 Agent S2は、3つの著名なコンピュータ使用ベンチマーク上でのSOTA(State-of-the-art)のパフォーマンスを確立する。
論文参考訳（メタデータ） (Tue, 01 Apr 2025 15:40:27 GMT)
Agent S: An Open Agentic Framework that Uses Computers Like a Human – arXiv最新論文の紹介のバージョン2、全般的に性能が上がり様々なベンチマークでSoTAを主張。
リポジトリはGitHub – simular-ai/Agent-S: Agent S: an open agentic framework that uses computers like a human

An Illusion of Progress? Assessing the Current State of Web Agents [49.8]
我々は,Webエージェントの現状を包括的かつ厳密に評価する。結果は、現在のエージェントの能力の非常に異なる描写を描いており、以前報告された結果に過度に最適化されていることを示唆している。オンライン評価ベンチマークであるOnline-Mind2Webを紹介した。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 05:51:29 GMT)
WEBエージェントのためのベンチマーク。「Many recent agents, except for Operator (OpenAI, 2025), underperform the simple SeeAct agent (Zheng et al , 2024) released in early 2024. Even Operator only achieves a success rate of 61%, showing substantial room for improvement.」とのこと。
リポジトリはGitHub – OSU-NLP-Group/Online-Mind2Web

(*1) 動きが面白いのでOpenManusをつかって無理やり対応している。今のところ実用性は疑問だが、近いうちにバージョンアップ予定。

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.2]
本稿では,GUIエージェントのネイティブモデルであるUI-TARSを紹介する。 OSWorldベンチマークでは、UI-TARSはスコアが24.6、50ステップが22.7、15ステップが22.7でクロード(それぞれ22.0と14.9)を上回っている。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 17:48:10 GMT)
GUIエージェント、UI-TARSの提案、様々なタスクでSOTAを主張。「UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for contextaware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines.」とやれることは盛り込んだ感がすごい。
リポジトリはGitHub – bytedance/UI-TARS

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.4]
グラフィカルユーザインタフェース(GUI)エージェントのための新しいデータ合成パイプラインであるOS-Genesisを提案する。事前に定義されたタスクに頼る代わりに、OS-Genesisはエージェントがまず環境を認識し、ステップワイドなインタラクションを実行することを可能にする。次に、生成された軌道の品質を保証するために軌道報酬モデルを用いる。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 16:21:58 GMT)
急速に研究が進むGUIエージェント開発のための合成データ構築手法の提案、「OS-Genesis begins by exploring the functionality of GUI environments through traversing interactive UI elements with actions (e g , CLICK). This forms the basis for reverse task synthesis, where observed states and actions are retroactively transformed into low-level instructions. These low-level instructions are then derived into high-level instructions, which can seed the collection of GUI trajectories.」と基礎データを構築、Trajectory Reward Modelで品質を保証。「Built upon GPT-4o, TRM aims to perform a graded evaluation with a reward score R ∈ [1, 5] to assist in sampling for training.」とのこと・・・。
リポジトリはOS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31