GUI agent – ページ 3 – arXiv最新論文の紹介

PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents

PAL-UI: Planning with Active Look-back for Vision-Based GUI Agents [151.9]
PAL-UI (Planning with Active Look-back) を提案する。 PAL-UIは、二重レベルの要約エージェントを組み合わせ、観察レベルの手がかりとアクションレベルの結果の両方を、専用の検索ツールと組み合わせる。
論文参考訳（メタデータ） (Wed, 01 Oct 2025 01:48:39 GMT)
振り返りに相当するPAL（Planning with Active Look-back）を組み込んだエージェントの提案、「PAL-UI significantly outperforms both base MLLMs and state-of-the-art baselines on mobile navigation benchmarks, while also general- izing well to out-of-domain web environments. These results underscore the importance of active memory retrieval for robust GUI planning. Future work will explore extending PAL-UI to more complex tasks and environments, integrating reinforcement learning objectives, and broadening its applicability to real-world interactive systems.」とのこと。

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents

Ferret-UI Lite: Lessons from Building Small On-Device GUI Agents [79.8]
Ferret-UI Liteは、様々なプラットフォームで動作する、コンパクトでエンドツーエンドのGUIエージェントである。 Ferret-UI Liteは、他の小規模GUIエージェントとの競合性能を達成する。
論文参考訳（メタデータ） (Tue, 30 Sep 2025 17:13:56 GMT)
AppleによるGUIエージェントの報告、「In this work, we present Ferret-UI Lite, a 3B multimodal LLM designed for GUI agentic tasks with a focus on lightweight, on-device settings. Through real and synthetic data curation, inference-time visual tool use, and a two-stage SFT–RL training strategy, Ferret-UI Lite achieves competitive grounding and navigation performance relative to larger models.」と小型のモデル。

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents [15.0]
本稿ではGUIショートカットハイブリッドエージェントの評価の先駆けとなるベンチマークであるMAS-Benchを紹介する。 11の現実世界アプリケーションに139の複雑なタスク、88のショートカットの知識ベース、RPAスクリプト、そして7つの評価メトリクスがある。実験の結果、ハイブリッドエージェントはGUIのみのエージェントよりも成功率と効率が著しく高いことがわかった。
論文参考訳（メタデータ） (Mon, 08 Sep 2025 09:43:48 GMT)
GUI操作をショートカットする（画面を操作せずにAPIコールするなど）ことも含めたベンチマークの提案。
プロジェクトサイトはMAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data

ScaleCUA: Scaling Open-Source Computer Use Agents with Cross-Platform Data [119.8]
ScaleCUAは、オープンソースのコンピュータ利用データとファンデーションモデルをスケーリングするためのステップである。 6つのオペレーティングシステムと3つのタスクドメインにまたがる大規模なデータセットを提供する。
論文参考訳（メタデータ） (Thu, 18 Sep 2025 17:59:22 GMT)
「In this work, we introduce ScaleCUA, a step toward scaling open-source CUAs. It offers a large-scale dataset spanning 6 operating systems and 3 task domains, via a closed-loop pipeline uniting automated agents with human experts. Trained on this scaled-up data, ScaleCUA can operate seamlessly across platforms. Specifically, it delivers strong gains over baselines (+26.6 on WebArena-Lite-v2, +10.7 on ScreenSpot-Pro) and sets new state-of-the art results (94.4% on MMBench-GUI L1-Hard, 60.6% on OSWorld-G, 47.4% on WebArena-Lite-v2). These findings underscore the power of data-driven scaling for general-purpose cross-platform CUAs.」と非常に正攻法な性能向上。
リポジトリはGitHub – OpenGVLab/ScaleCUA: ScaleCUA is the open-sourced computer use agents that can operate on corss-platform environments (Windows, macOS, Ubuntu, Android).

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent

AppCopilot: Toward General, Accurate, Long-Horizon, and Efficient Mobile Agent [49.6]
本稿では,モバイルエージェントが現実的かつスケーラブルな影響をもたらすためには,解決すべき4つの中核的問題を特定する。本稿では,マルチモーダル,マルチエージェント,汎用オンデバイスアシスタントであるAppCopilotを紹介する。 AppCopilotはアプリケーション間で動作し、データからデプロイメントまでの完全なクローズドループシステムを構成する。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 15:48:21 GMT)
この分野の教科書ともいえる情報量を持つ論文。結論の「In summary, mobile agents are entering a new era of ecosystem development in intelligent automation, cross-platform operation, and continual learning. Importantly, these abilities should not be viewed as a mere summary of existing achievements, but rather as a vision for future evolution.」はまさにそうで、様々な研究機関が相応のリソースを投入している理由だと思う。
リポジトリはGitHub – OpenBMB/AppCopilot: A General, Accurate, Long-Horizon, and Efficient Mobile Agent driven by Multimodal Foundation Models

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [151.0]
グラフィカルユーザインタフェースのための自律エージェントの開発は、人工知能における大きな課題を示している。本稿では,GUI中心のエージェントモデルであるUI-TARS-2を提案する。実証的な評価では、UI-TARS-2は以前のUI-TARS-1.5よりも大幅に改善されている。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 17:44:45 GMT)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介, UFO2: The Desktop AgentOS , UI-TARS-1.5 – arXiv最新論文の紹介のアップデート。「Empirical evaluation shows that UI-TARS-2 delivers significant improvements over UI-TARS-1.5 [56], achieving strong results in both GUI-based interaction and game environments. On GUI benchmarks, the model reaches 88.2 on Online-Mind2Web [77], 47.5 on OSWorld [75], 50.6 on WindowsAgentArena [10], and 73.3 on AndroidWorld [52], representing clear gains over the previous generation and outperforming strong baselines such as Claude and OpenAI agents in multiple cases.」と前回モデルに比べ大きな改善を主張。下記が改善点ということではあるが、最初のバージョンからやれることは全部やるという雰囲気がすごい
- First, to mitigate data scarcity, we design a scalable Data Flywheel that co-evolves the model and its training corpus through continual pretraining, supervised fine-tuning, rejection sampling, and multiturn RL
- Second, to overcome the difficulties of scalable multi-turn RL, we design a training framework that stabilizes optimization in long-horizon settings.
- Third, to move beyond the limitations of pure GUI interaction, we construct a hybrid GUI-centered environment that augments on-screen actions with access to complementary resources such as file systems, terminals, and other external tools, enabling agents to solve a broader spectrum of realistic workflows.
- Fourth, to support large-scale training and evaluation, we build a unified sandbox platform capable of orchestrating heterogeneous environments—ranging from cloud VMs for GUI interaction to browser-based sandboxes for games—under a consistent API.
リポジトリはGitHub – bytedance/UI-TARS

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding [16.9]
トレーニングと推論の両方においてGUIエージェントを強化するUI-AGILEを導入する。トレーニングのために,スーパービジョン・ファイン・チューニング(SFT)プロセスの一連の改善を提案する。推測のために,高解像度ディスプレイのグラウンド化精度を劇的に向上させるために,選択による分解グラウンド化を提案する。
論文参考訳（メタデータ） (Sat, 09 Aug 2025 17:51:27 GMT)
GUIエージェントの性能に大きく影響するグラウンディング能力を強化するフレームワークの提案。「UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and Cropping-based Resampling, and inference with Decomposed Grounding with Selection.」とのこと。
リポジトリはGitHub – KDEGroup/UI-AGILE

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows [10.3]
大規模言語モデル(LLM)は、複雑で長期の推論を必要とする現実世界のアプリケーションにますます多くデプロイされている。 OdysseyBenchは、様々なオフィスアプリケーションにわたる長期にわたってLLMエージェントを評価するための包括的なベンチマークである。スケーラブルなベンチマーク作成を実現するために,長期ワークフローベンチマークの自動生成を行うマルチエージェントフレームワークであるHomerAgentsを提案する。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 17:53:03 GMT)
「We introduce OdysseyBench, a comprehensive benchmark for evaluating agents on long- horizon workflows across multiple office applications, consisting of OdysseyBench+ and OdysseyBench-Neo. 」、「• We propose HOMERAGENTS, a multi-agent framework that automates the generation of long-horizon tasks, enabling scalable and diverse benchmark creation.」とベンチマーク作成フレームワークを含むベンチマークの提案。
リポジトリはhttps://github.com/microsoft/OdysseyBenchとのことだが現時点では404

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.6]
アイアンマンの架空のJ.A.R.V.I.Sほど有能で多用途なAIアシスタントを作る夢は、長い間想像力に恵まれてきた。マルチモーダル(multi-modal)な大きな言語モデル((M)LLMs)の進化により、この夢は現実に近づいている。本調査は,OSエージェント研究の現状を整理し,学術調査と産業開発の両方の指針を提供する。
論文参考訳（メタデータ） (Wed, 06 Aug 2025 14:33:45 GMT)
「The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multimodal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e g , computers and mobile phones) by operating within the environments and interfaces (e g , Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced.」から始まるサーベイ。
リポジトリはOS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use (ACL 2025)

CoAct-1: Computer-using Agents with Coding as Actions

CoAct-1: Computer-using Agents with Coding as Actions [95.0]
CoAct-1はGUIベースの制御と直接プログラム実行を組み合わせた新しいマルチエージェントシステムである。我々は、CoAct-1が60.76%の最先端の成功率を達成したOSWorldベンチマークで、我々のシステムを評価した。
論文参考訳（メタデータ） (Tue, 05 Aug 2025 21:33:36 GMT)
「CoAct-1 features an Orchestrator that dynamically delegates subtasks to either a conventional GUI Operator or a specialized Programmer agent, which can write and execute Python or Bash scripts. This hybrid approach allows the agent to bypass inefficient GUI action sequences for tasks like file management and data processing, while still leveraging visual interaction when necessary.」とコード生成をうまく使うGUIエージェントの提案。OS WorldでSoTAを主張。
プロジェクトサイトはCoAct-1

2026年2月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28