GUI – arXiv最新論文の紹介

UFO2: The Desktop AgentOS , UI-TARS-1.5

UFO2: The Desktop AgentOS [60.3]
UFO2はWindowsデスクトップ用のマルチエージェントAgentOSで、実用的なシステムレベルの自動化に発展している。我々は、20以上の現実世界のWindowsアプリケーションに対してUFO2を評価し、従来のCUAよりもロバスト性および実行精度を大幅に改善した。我々の結果は、ディープOSの統合によって、信頼性の高いユーザ指向のデスクトップ自動化へのスケーラブルな道が開けることを示している。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 13:04:43 GMT)
OS-COPILOT/FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You)とUFO（UI-Focused） – arXiv最新論文の紹介のバージョン2。AgentSもだが、バージョンが上がっていくのにこの分野の盛り上がりを感じる。bytedanceからはUI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介の次バージョンUI-TARS-1.5がでている（UI-TARS：Next-generation native GUI agent model designed to interact seamlessly with GUIs using human-like perception、下記）
リポジトリはGitHub – microsoft/UFO: The Desktop AgentOS.

Introducing UI-TARS-1.5
UI-TARS-1.5は、強力な視覚言語モデル上に構築されたオープンソースのマルチモーダルエージェントである。強化学習によって実現される高度な推論を統合する。さまざまな標準ベンチマークで最先端の結果が得られる。

Towards Trustworthy GUI Agents: A Survey

Towards Trustworthy GUI Agents: A Survey [64.6]
本調査では,GUIエージェントの信頼性を5つの重要な次元で検証する。敵攻撃に対する脆弱性、シーケンシャルな意思決定における障害モードのカスケードなど、大きな課題を特定します。 GUIエージェントが普及するにつれて、堅牢な安全基準と責任ある開発プラクティスを確立することが不可欠である。
論文参考訳（メタデータ） (Sun, 30 Mar 2025 13:26:00 GMT)
GUIエージェントの信頼性に関するサーベイ。整理軸は「Security」、「Reliability」、「Explainability」、「Ethical Alignment」、「Evaluation methodologies」

Agent S2, Devin 2, Amazon Nova Act, An Illusion of Progress? Assessing the Current State of Web Agents

以前取り上げたAgent Sのバージョン2が出ていた。半年でOS Worldのスコアが20.5から27.0（15Step）に上がっており、ベースモデル（LLM）の性能向上もあるだろうが着実な進化を感じる。Introducing Amazon Nova Act | Amazon AGI Labs、Cognition | Devin 2.0など発表が相次ぎGUI Agent的なLLM based Agentは流行している。

個人のサイトでもfugumt.comはFugu-MT:AgentでAgent化を行っている（OpenManusを使ったサイトへのエージェント組み込み | ぷるーふおぶこんせぷと）。容易にサイトの機能を拡張できることから、今後このようなサイトが増えてくるのではないかと思う(*1)。

そのような中「An Illusion of Progress? Assessing the Current State of Web Agents 」では「Surprisingly, many recent agents, except for Operator, do not outperform the simple SeeAct agent (Zheng et al , 2024) released in early 2024.」とも指摘されている。同論文にもある通り、正しい評価データセットやフレームワークが求められている。

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.3]
コンピュータ利用エージェントは、コンピュータやモバイルデバイスのグラフィカルユーザインタフェース(GUI)と直接対話することで、デジタルタスクを自動化する。本稿では,様々なジェネラリストおよびスペシャリストモデルにまたがって認知的責任を委譲する新しい構成フレームワークであるAgens S2を紹介する。 Agent S2は、3つの著名なコンピュータ使用ベンチマーク上でのSOTA(State-of-the-art)のパフォーマンスを確立する。
論文参考訳（メタデータ） (Tue, 01 Apr 2025 15:40:27 GMT)
Agent S: An Open Agentic Framework that Uses Computers Like a Human – arXiv最新論文の紹介のバージョン2、全般的に性能が上がり様々なベンチマークでSoTAを主張。
リポジトリはGitHub – simular-ai/Agent-S: Agent S: an open agentic framework that uses computers like a human

An Illusion of Progress? Assessing the Current State of Web Agents [49.8]
我々は,Webエージェントの現状を包括的かつ厳密に評価する。結果は、現在のエージェントの能力の非常に異なる描写を描いており、以前報告された結果に過度に最適化されていることを示唆している。オンライン評価ベンチマークであるOnline-Mind2Webを紹介した。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 05:51:29 GMT)
WEBエージェントのためのベンチマーク。「Many recent agents, except for Operator (OpenAI, 2025), underperform the simple SeeAct agent (Zheng et al , 2024) released in early 2024. Even Operator only achieves a success rate of 61%, showing substantial room for improvement.」とのこと。
リポジトリはGitHub – OSU-NLP-Group/Online-Mind2Web

(*1) 動きが面白いのでOpenManusをつかって無理やり対応している。今のところ実用性は疑問だが、近いうちにバージョンアップ予定。

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC

PC-Agent: A Hierarchical Multi-Agent Collaboration Framework for Complex Task Automation on PC [98.8]
本稿では,PC-Agentという階層型エージェントフレームワークを提案する。認識の観点からは,現在のMLLMのスクリーンショットコンテンツに対する認識能力の不十分さを克服するために,アクティブ知覚モジュール(APM)を考案する。意思決定の観点から、複雑なユーザ命令や相互依存サブタスクをより効果的に扱うために、階層的なマルチエージェント協調アーキテクチャを提案する。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 05:41:55 GMT)
(1) Active Perception Module、(2) Hierarchical Multi-agent Collaboration、(3) Reflection-based Dynamic Decision-makingを特徴とするフレームワークの提案。評価のためのベンチマークも構築。UFOやAgent-Sに比べ優位性を主張。
Manger Agent 、Progress Agent 、Decision Agent 、Reflection Agent のマルチエージェント構成。

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.2]
本稿では,GUIエージェントのネイティブモデルであるUI-TARSを紹介する。 OSWorldベンチマークでは、UI-TARSはスコアが24.6、50ステップが22.7、15ステップが22.7でクロード(それぞれ22.0と14.9)を上回っている。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 17:48:10 GMT)
GUIエージェント、UI-TARSの提案、様々なタスクでSOTAを主張。「UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for contextaware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines.」とやれることは盛り込んだ感がすごい。
リポジトリはGitHub – bytedance/UI-TARS

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.4]
グラフィカルユーザインタフェース(GUI)エージェントのための新しいデータ合成パイプラインであるOS-Genesisを提案する。事前に定義されたタスクに頼る代わりに、OS-Genesisはエージェントがまず環境を認識し、ステップワイドなインタラクションを実行することを可能にする。次に、生成された軌道の品質を保証するために軌道報酬モデルを用いる。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 16:21:58 GMT)
急速に研究が進むGUIエージェント開発のための合成データ構築手法の提案、「OS-Genesis begins by exploring the functionality of GUI environments through traversing interactive UI elements with actions (e g , CLICK). This forms the basis for reverse task synthesis, where observed states and actions are retroactively transformed into low-level instructions. These low-level instructions are then derived into high-level instructions, which can seed the collection of GUI trajectories.」と基礎データを構築、Trajectory Reward Modelで品質を保証。「Built upon GPT-4o, TRM aims to perform a graded evaluation with a reward score R ∈ [1, 5] to assist in sampling for training.」とのこと・・・。
リポジトリはOS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

GUI Agents: A Survey

GUI Agents: A Survey [129.9]
グラフィカルユーザインタフェース(GUI)エージェントは、人間とコンピュータのインタラクションを自動化するためのトランスフォーメーションアプローチとして登場した。 GUIエージェントの関心の高まりと基本的な重要性により、ベンチマーク、評価指標、アーキテクチャ、トレーニングメソッドを分類する総合的な調査を提供する。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 04:48:28 GMT)
GUIをつかうエージェントに関するサーベイ

Large Language Model-Brained GUI Agents: A Survey

Large Language Model-Brained GUI Agents: A Survey [43.2]
マルチモーダルモデルはGUI自動化の新しい時代を支えてきた。彼らは自然言語理解、コード生成、視覚処理において例外的な能力を示した。これらのエージェントはパラダイムシフトを表しており、ユーザーは単純な会話コマンドで複雑なマルチステップタスクを実行できる。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 12:13:39 GMT)
GUI Agents with Foundation Models: A Comprehensive Survey – arXiv最新論文の紹介ににたサーベイだが、こちらはMicrosoftの研究者が筆頭著者。

GUI Agents with Foundation Models: A Comprehensive Survey

GUI Agents with Foundation Models: A Comprehensive Survey [53.0]
この調査は(M)LLMベースのGUIエージェントに関する最近の研究を集約する。データ、フレームワーク、アプリケーションにおける重要なイノベーションを強調します。本稿では, (M)LLM ベースの GUI エージェントの分野におけるさらなる発展を期待する。
論文参考訳（メタデータ） (Thu, 07 Nov 2024 17:28:10 GMT)
MLLMベースのGUIエージェントのサーベイ
研究が進んでいると思ったらサーベイが発表されるスピード感がこの分野の現状を表していると思う。

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.4]
OS-AtlasはGUIグラウンディングとOODエージェントタスクに優れた基礎的なGUIアクションモデルである。現在までに1300万以上のGUI要素を含む、オープンソースのクロスプラットフォームGUI基盤コーパスをリリースしています。
論文参考訳（メタデータ） (Wed, 30 Oct 2024 17:10:19 GMT)
GUIを対象としたFoundation Action Modelの提案、Anthropicの発表もあって盛り上がっている領域。性能は「although GPT-4o with OS-Atlas-Base as the grounding module still lags behind human performance, it significantly outperforms other grounding methods such as SeeClick and Set-of-Mark (SoM)」とのこと。
リポジトリはOS-Atlas Homepage

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31