Autonomous Agent – ページ 7 – arXiv最新論文の紹介

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs

OpenScholar: Synthesizing Scientific Literature with Retrieval-augmented LMs [151.8]
我々は,4500万件のオープンアクセス論文と引用支援の回答を関連づけることで,科学的クエリに答える特殊な検索拡張LMであるOpenScholarを紹介した。 ScholarQABench では OpenScholar-8B が GPT-4o を5%、PaperQA2 を7% 上回っている。 OpenScholarのデータストア、レトリバー、セルフフィードバック推論ループも、既製のLMを改善している。
論文参考訳（メタデータ） (Thu, 21 Nov 2024 15:07:42 GMT)
科学に関するクエリに答えるためのシステムの提案。「OPENSCHOLAR consists of a specialized datastore, retrievers and LMs and iteratively improves responses using self-feedback inference with retrieval.」とやり切っている感がすごい。ベンチマークも構築しており、「OPENSCHOLAR using our trained 8B and GPT4o achieves a 51% and 70% win rate against human-generated answers.」とGPT-4o以上を主張。
Blog:Ai2 OpenScholar: Scientific literature synthesis with retrieval-augmented language models | Ai2 、Code:GitHub – AkariAsai/ScholarQABench: This repository contains ScholarQABench data and evaluation pipeline.、デモ:Ai2 OpenScholarなど多くのリソースが公開されている。

Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge

Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge [47.7]
CHAICは、インボディードエージェントの社会的知覚と協力をテストするために設計された包括的インボディード・ソーシャル・インテリジェンス・チャレンジである。 CHAICの目標は、身体的制約の下で活動している可能性がある人間を支援するために、自我中心の観察装置を備えたエンボディエージェントである。
論文参考訳（メタデータ） (Mon, 04 Nov 2024 04:41:12 GMT)
「In CHAIC, the goal is for an embodied agent equipped with egocentric observations to assist a human who may be operating under physical constraints—e g , unable to reach high places or confined to a wheelchair—in performing common household or outdoor tasks as efficiently as possible.」というタスク・ベンチマークの提案。このようなチャレンジが現実的になってきたことにAIの急速な進化を感じる。
リポジトリはGitHub – UMass-Foundation-Model/CHAIC: [NeurIPS D&B Track 2024] Source code for the paper “Constrained Human-AI Cooperation: An Inclusive Embodied Social Intelligence Challenge”

WorkflowLLM

WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Language Models [105.5]
ワークフローオーケストレーションにおける大規模言語モデルの能力を高めるための,データ中心のフレームワークであるLLMを提案する。最初は106,763のサンプルで大規模な微調整Benchを構築し、28のカテゴリにわたる83のアプリケーションから1,503のAPIをカバーしている。 LlamaLlamaは複雑なAPIをオーケストレーションする能力を示しながら、優れた一般化性能を実現している。
論文参考訳（メタデータ） (Fri, 08 Nov 2024 09:58:02 GMT)
エージェント開発において重要となるワークフロー生成に関するベンチマークの提案とLLMの構築。
(1) Data Collection、(2) Query Expansion、(3) Workflow Generation、合成データを用いたWorkflowBenchの作成、fine-tuneによる WorkflowLlamaの構築と合成データを併用する一般的な手順ではあるが、GPT-4o w/ICLを完全にoutperformしているのが興味深い。
リポジトリはGitHub – OpenBMB/WorkflowLLM

GUI Agents with Foundation Models: A Comprehensive Survey

GUI Agents with Foundation Models: A Comprehensive Survey [53.0]
この調査は(M)LLMベースのGUIエージェントに関する最近の研究を集約する。データ、フレームワーク、アプリケーションにおける重要なイノベーションを強調します。本稿では, (M)LLM ベースの GUI エージェントの分野におけるさらなる発展を期待する。
論文参考訳（メタデータ） (Thu, 07 Nov 2024 17:28:10 GMT)
MLLMベースのGUIエージェントのサーベイ
研究が進んでいると思ったらサーベイが発表されるスピード感がこの分野の現状を表していると思う。

DynaSaur: Large Language Agents Beyond Predefined Actions

DynaSaur: Large Language Agents Beyond Predefined Actions [108.8]
既存のLLMエージェントシステムは、通常、各ステップで固定セットと事前定義されたセットからアクションを選択する。動作の動的生成と構成をオンラインで実現するLLMエージェントフレームワークを提案する。 GAIAベンチマーク実験により, このフレームワークは柔軟性が向上し, 従来の手法よりも優れていたことが確認された。
論文参考訳（メタデータ） (Mon, 04 Nov 2024 02:08:59 GMT)
Agenticな動きの各ステージをPythonコードとしコード生成を使うことによって柔軟性を増したフレームワークの提案。「We have explored an LLM agent framework that implements its own actions as Python functions to interact with the world and accumulate its generated actions over time, thus growing a toolset of actions for problem-solving in future tasks.」GAIA Leaderboard – a Hugging Face Space by gaia-benchmarkで高い性能を達成。
リポジトリはGitHub – adobe-research/dynasaur: Official repository for “DynaSaur: Large Language Agents Beyond Predefined Actions”　（現時点ではコードがアップロードされていないよう）

Agent K

Large Language Models Orchestrating Structured Reasoning Achieve Kaggle Grandmaster Level [73.1]
我々は、エンドツーエンドの自律データサイエンスエージェントであるAgent K v1.0を紹介する。経験から学ぶことによって、データサイエンスのライフサイクル全体を管理する。キー情報を選択的に保存して検索することで、長期記憶と短期記憶を最適化する。
論文参考訳（メタデータ） (Tue, 05 Nov 2024 23:55:23 GMT)
「our results indicate that Agent K v1.0 has reached a performance level equivalent to Kaggle Grandmaster, with a record of 6 gold medals, 3 silver medals, and 7 bronze medals」とKaggleのグランドマスター並みを主張するエージェントシステムの提案。
パイプライン構成やプロンプトなど参考になる点は多いが、「However, because this assessment relies on a custom split of the training data rather than the competition’s actual private test set, it remains uncertain whether an agent’s high ranking in this context would align with results on the original Kaggle leaderboard.」という記載やLeakの可能性など「ほんまかいな」という疑問点はなくはない。

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions

AutoKaggle: A Multi-Agent Framework for Autonomous Data Science Competitions [47.7]
AutoKaggleは、コード実行と単体テストを組み合わせた反復的な開発プロセスを実装し、コードの正しさとロジックの整合性を保証する。データクリーニング、特徴工学、モデリングのための検証済み機能を含む汎用データサイエンスツールキットは、このソリューションの基礎を形成します。 AutoKaggleは、一般的なデータサイエンスパイプラインにおけるバリデーションレート0.85と総合スコア0.82を達成する。
論文参考訳（メタデータ） (Sun, 27 Oct 2024 12:44:25 GMT)
Kaggleのようなデータ分析の自動化。対象としているタスク（分析フェーズ）は「background understanding, preliminary exploratory data analysis, data cleaning (DC), in-depth exploratory data analysis, feature engineering (FE), and model building, validation, and prediction (MBVP).」で通常のAutoMLより広い、対象データはテーブルデータのよう。
「As our analysis relies on GPT-4o, which is trained on data available until October 2023, it includes most of the Classic Kaggle competitions.To evaluate the generalization capabilities of AutoKaggle, we therefore focus on competitions initiated after 2024.」とLeakには気を使っているとはいえ、「Evaluation results demonstrate that AutoKaggle achieves a validation submission rate of 0.85 and a comprehensive score of 0.82 in typical data science pipelines, fully proving its effectiveness and practicality in handling complex data science tasks.」という言いきりは凄い。もっとも、今のLLMの性能からして適切なパイプラインを組めば解けそうな問題であるという感覚はある。
リポジトリはGitHub – multimodal-art-projection/AutoKaggle

Evaluating Cultural and Social Awareness of LLM Web Agents

Evaluating Cultural and Social Awareness of LLM Web Agents [113.5]
CASAは,大規模言語モデルの文化的・社会的規範に対する感受性を評価するためのベンチマークである。提案手法は,標準に違反するユーザクエリや観察を検知し,適切に応答するLLMエージェントの能力を評価する。実験により、現在のLLMは非エージェント環境で大幅に性能が向上していることが示された。
論文参考訳（メタデータ） (Wed, 30 Oct 2024 17:35:44 GMT)
「(1) Can LLM agents detect and appropriately respond to user queries that violate cultural or social norms, such as searching for a wine gift in Iran, where it is culturally inappropriate?」というような文化的・社会的な面を考慮可能かを測るベンチマークの提案と検証。結果は「Specifically, LLMs perform considerably better in non-agent environments compared to web-based agent settings.」とやや驚き。
エージェント設計時の注意が必要なことが分かる。

Claude 3.5 Sonnet, Haiku, Computer use, Aya Expanse

先週の話題で大きかったのはAnthropicによる Claude 3.5 Sonnetの強化とPC（GUI）を操作するエージェントの発表だった。

Introducing computer use, a new Claude 3.5 Sonnet, and Claude 3.5 Haiku \ Anthropic

前者はOpusを名乗らなかったのが注目で、さらなる高精度なモデルが用意されているとすると期待が大きい。後者はAgent S: An Open Agentic Framework that Uses Computers Like a Human – arXiv最新論文の紹介などのようにGUIを使うアプローチが良いのか、OS-COPILOT/FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You)とUFO（UI-Focused） – arXiv最新論文の紹介のAPI（コード）を介するアプローチが良いのかは議論が分かれるところだが、この手の進化には要注目である。

Cohereから出ている多言語モデルAyaにも要注目。Aya Expanse: Connecting Our World

GemmaやLlama、Mistral以上を主張するモデルでCC-BY NCで公開されている。CohereForAI/aya-expanse-8b · Hugging Face、CohereForAI/aya-expanse-32b · Hugging Face

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance

Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance [95.0]
我々は、人間の指示なしにタスクを予測および開始できるプロアクティブエージェントを開発するという課題に取り組む。まず,実世界の人的活動を収集し,前向きなタスク予測を生成する。これらの予測は、ヒトのアノテータによって受け入れられるか拒否されるかのどちらかとしてラベル付けされる。ラベル付きデータは、人間の判断をシミュレートする報酬モデルをトレーニングするために使用される。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 08:24:09 GMT)
指示なしで動くエージェントの開発、「we investigate a new scenario where the agent autonomously predicts tasks users might assign, aiming to offer assistance proactively」という設定。ProactiveBenchというベンチマークを構築し評価を行っている。fine tuningが非常に有効そうに見えるのはタスクの特殊性が原因だろうか。
リポジトリはGitHub – thunlp/ProactiveAgent: A LLM-based Agent that predict its tasks proactively.

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31