Autonomous Agent – ページ 4 – arXiv最新論文の紹介

Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching

Wizard of Shopping: Target-Oriented E-commerce Dialogue Generation with Decision Tree Branching [39.5]
会話型商品検索(CPS)の目的は、インテリジェントなチャットベースのショッピングアシスタントを開発することである。本稿では,大規模言語モデル(LLM)を利用して,現実的で自然な会話を生成する新しい手法TRACERを提案する。
論文参考訳（メタデータ） (Mon, 03 Feb 2025 00:27:13 GMT)
「We leverage decision tree to explore the vast product search space, and construct a dialogue plan that minimizes the number of search steps required to retrieve a relevant product.」という会話生成手法の提案
直接生成せずに木構造を介すというアプローチはCondor: Enhance LLM Alignment with Knowledge-Driven Data Synthesis and Refinement – arXiv最新論文の紹介に近いのだろうか。

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

CowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation [70.3]
CowPilotは、自律的および人間とエージェントの協調的なWebナビゲーションをサポートするフレームワークである。エージェントが次のステップを提案することによって、人間が実行しなければならないステップの数を減らすと同時に、ユーザが一時停止、拒否、代替アクションを取ることができる。 CowPilotは、Webサイト間でのデータ収集とエージェント評価のための便利なツールとして機能する。
論文参考訳（メタデータ） (Tue, 28 Jan 2025 00:56:53 GMT)
人間とエージェントが協調することを前提としたフレームワークの提案。「We conducted case studies on five common websites and found that the human-agent collaborative mode achieves the highest success rate of 95% while requiring humans to perform only 15.2% of the total steps.」で現実的な効率化につながりそうな結果。（ではあるが、多くのタスクで完全自動化と協調的な自動化の意味は大きく違う点には注意が必要。）
プロジェクトサイトはCowPilot: A Framework for Autonomous and Human-Agent Collaborative Web Navigation

UI-TARS: Pioneering Automated GUI Interaction with Native Agents

UI-TARS: Pioneering Automated GUI Interaction with Native Agents [58.2]
本稿では,GUIエージェントのネイティブモデルであるUI-TARSを紹介する。 OSWorldベンチマークでは、UI-TARSはスコアが24.6、50ステップが22.7、15ステップが22.7でクロード(それぞれ22.0と14.9)を上回っている。
論文参考訳（メタデータ） (Tue, 21 Jan 2025 17:48:10 GMT)
GUIエージェント、UI-TARSの提案、様々なタスクでSOTAを主張。「UI-TARS incorporates several key innovations: (1) Enhanced Perception: leveraging a large-scale dataset of GUI screenshots for contextaware understanding of UI elements and precise captioning; (2) Unified Action Modeling, which standardizes actions into a unified space across platforms and achieves precise grounding and interaction through large-scale action traces; (3) System-2 Reasoning, which incorporates deliberate reasoning into multi-step decision making, involving multiple reasoning patterns such as task decomposition, reflection thinking, milestone recognition, etc. (4) Iterative Training with Reflective Online Traces, which addresses the data bottleneck by automatically collecting, filtering, and reflectively refining new interaction traces on hundreds of virtual machines.」とやれることは盛り込んだ感がすごい。
リポジトリはGitHub – bytedance/UI-TARS

PaSa: An LLM Agent for Comprehensive Academic Paper Search

PaSa: An LLM Agent for Comprehensive Academic Paper Search [9.7]
PaSaは大規模言語モデルを利用した高度な論文検索エージェントである。合成データセットであるAutoScholarQueryを用いた強化学習を用いてPaSaを最適化する。合成データでトレーニングされているにも関わらず、PaSaはRealScholarQueryの既存のベースラインを大幅に上回っている。
論文参考訳（メタデータ） (Fri, 17 Jan 2025 11:12:28 GMT)
「PaSa can autonomously make a series of decisions, including invoking search tools, reading papers, and selecting relevant references, to ultimately obtain comprehensive and accurate results for complex scholarly queries.」という論文情報を集めてくるエージェント。
ベンチマークとしてAutoScholarQueryを構築している点が特徴的なのと、「Although PaSa is trained solely on synthetic data, it achieves remarkable real-world performance.」は少し驚き。

WebWalker: Benchmarking LLMs in Web Traversal

WebWalker: Benchmarking LLMs in Web Traversal [55.4]
WebWalkerQAは,LLMがWebトラバースを実現する能力を評価するためのベンチマークである。本稿では,WebWalkerを提案する。WebWalkerは,探索的・批判的パラダイムを通じて,人間のようなWebナビゲーションを模倣するマルチエージェントフレームワークである。
論文参考訳（メタデータ） (Mon, 13 Jan 2025 18:58:07 GMT)
「It evaluates the capacity of LLMs to traverse a website’s subpages to extract high-quality data systematically.」というWEBサイトをめぐりながら必要な情報をとれるか否かのベンチマークWebWalkerQAとそれを解くためのマルチエージェントフレームワークWebWalkerの提案。Agenticな動作を行い、かつ、GPT-4oなど先端モデルを使っても解くのが難しいデータセットになっている。（やや意外）
プロジェクトサイトはWebWalker、リポジトリはGitHub – Alibaba-NLP/WebWalker: 🌐 WebWaker: Benchmarking LLMs in Web Traversal、WebWalkerQALeaderboard – a Hugging Face Space by callanwuもある

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Search-o1: Agentic Search-Enhanced Large Reasoning Models [24.2]
OpenAI-o1のような大きな推論モデル(LRM)は、大規模な強化学習を通じて、大きなステップワイズ推論能力を実証している。エージェント検索拡張生成(RAG)機構とReason-in-Documentsモジュールを併用し,LRMを強化するフレームワークである textbfSearch-o1 を紹介する。
論文参考訳（メタデータ） (Thu, 09 Jan 2025 16:48:17 GMT)
RAG + Large Rrasoning Modelなフレームワークの提案。Agenticなアプローチに見えなくもないが、「(a) Direct reasoning without retrieval often results in inaccuracies due to missing knowledge. (b) Our agentic retrieval-augmented reasoning approach improves knowledge access but usually returns lengthy, redundant documents, disrupting coherent reasoning. (c) Our Search-o1 integrates concise and accurate retrieved knowledge seamlessly into the reasoning process, enabling precise and coherent problem-solving.」とReason-in-Documentsを用いLRMと別の処理として推論の流れに沿った情報を選択・要約してLRMに組み込む有効性を主張している。
リポジトリはSearch-o1: Agentic Search-Enhanced Large Reasoning Models

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

OS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis [55.4]
グラフィカルユーザインタフェース(GUI)エージェントのための新しいデータ合成パイプラインであるOS-Genesisを提案する。事前に定義されたタスクに頼る代わりに、OS-Genesisはエージェントがまず環境を認識し、ステップワイドなインタラクションを実行することを可能にする。次に、生成された軌道の品質を保証するために軌道報酬モデルを用いる。
論文参考訳（メタデータ） (Fri, 27 Dec 2024 16:21:58 GMT)
急速に研究が進むGUIエージェント開発のための合成データ構築手法の提案、「OS-Genesis begins by exploring the functionality of GUI environments through traversing interactive UI elements with actions (e g , CLICK). This forms the basis for reverse task synthesis, where observed states and actions are retroactively transformed into low-level instructions. These low-level instructions are then derived into high-level instructions, which can seed the collection of GUI trajectories.」と基礎データを構築、Trajectory Reward Modelで品質を保証。「Built upon GPT-4o, TRM aims to perform a graded evaluation with a reward score R ∈ [1, 5] to assist in sampling for training.」とのこと・・・。
リポジトリはOS-Genesis: Automating GUI Agent Trajectory Construction via Reverse Task Synthesis

Training Software Engineering Agents and Verifiers with SWE-Gym

Training Software Engineering Agents and Verifiers with SWE-Gym [89.6]
SWE-Gymは、現実世界のソフトウェアエンジニアリング(SWE)エージェントをトレーニングするための最初の環境である。 SWE-Gymには2,438の現実世界のPythonタスクインスタンスが含まれている。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 18:15:39 GMT)
ソフトウェアエンジニアリング用エージェント開発のための環境の提案、および、高性能なエージェントの開発。o3で圧倒的な結果を見た後ではあるが、「Through extensive experiments, we demonstrate that SWE-Gym enables both agent and verifier models to achieve significant improvements in resolving complex software tasks. Our findings highlight the scalability of these approaches, revealing potential for continuous performance gains with increased compute.」とエージェント的動作の有効性は高い。
リポジトリはGitHub – SWE-Gym/SWE-Gym

ResearchTown: Simulator of Human Research Community

ResearchTown: Simulator of Human Research Community [14.0]
ResearchTownは、リサーチコミュニティシミュレーションのためのマルチエージェントフレームワークである。 ResearchTownは、協調研究活動の現実的なシミュレーションを提供する。 ResearchTownは、複数の研究者と多様な論文で堅牢なシミュレーションを維持できる。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 18:26:53 GMT)
流行っているマルチエージェントフレームワーク、だが、ついにTownに。。。
グラフ構造を変更するとどうなるかに興味津々
リポジトリはGitHub – ulab-uiuc/research-town: A platform for developers to simulate research community

PC Agent: While You Sleep, AI Works — A Cognitive Journey into Digital World

PC Agent: While You Sleep, AI Works — A Cognitive Journey into Digital World [19.0]
PC Agentは、人間の認知伝達を通じて、このビジョンに向けて重要なステップを示すAIシステムである。この仮説を検証するために、我々は3つの重要な革新を紹介した。 PowerPointのプレゼンテーション作成における予備的な実験は、少量の高品質な認知データで複雑なデジタル作業機能を実現することができることを示している。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 14:02:12 GMT)
「trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications.」という手法の提案。非常に効率的な印象。「In conclusion, we presented a cognition transfer framework that efficiently guides AI to the digital world through three key components: PC Tracker for collecting human-computer interaction data, a two-stage post-processing for cognition completion, and a multi-agent system for computer task automation.」とあるが、社会実装の上ではPC Tracker周りでいろいろとトラブルが起きそう。この手の操作データは誰に所属するべきなんだろう。
リポジトリはPC Agent: While You Sleep, AI Works – A Cognitive Journey into Digital World

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31