Autonomous Agent – ページ 9 – arXiv最新論文の紹介

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task

P-RAG: Progressive Retrieval Augmented Generation For Planning on Embodied Everyday Task [94.1]
Embodied Everyday Taskは、インボディードAIコミュニティで人気のあるタスクである。自然言語命令は明示的なタスクプランニングを欠くことが多い。タスク環境に関する知識をモデルに組み込むには、広範囲なトレーニングが必要である。
論文参考訳（メタデータ） (Tue, 17 Sep 2024 15:29:34 GMT)
自然言語の指示と環境情報が与えられた時のエージェント動作（計画など）にRAGを使うアプローチの提案。RAGのデータベースを動的に更新していくものでLLM based Agentsそのものの印象。
感覚的にRetrieveに難しさがありそうだが、「When an agent interacts with the environment during a task, it first receives the environment’s goal instruction 𝐼𝑔 and observation 𝑂𝑡. Then it encodes with MiniLM [31] both of them」とあるがこの方針でうまくいくのかという驚き。

Paper Copilot, TravelAgent

LLMを用いたアプリケーションに近い論文も内部動作・設計を見る上で参考になる。

Paper Copilot: A Self-Evolving and Efficient LLM System for Personalized Academic Assistance [14.5]
本稿では,研究者を支援する自己進化型,効率的なLCMシステムであるPaper Copilotを紹介する。 Paper Copilotはパーソナライズされたリサーチサービスを提供し、リアルタイムで更新されたデータベースを維持する。本稿では,Paper Copilotの設計と実装について詳述し,パーソナライズされた学術的支援への貢献と研究プロセスの合理化の可能性について述べる。
論文参考訳（メタデータ） (Fri, 06 Sep 2024 20:04:04 GMT)
論文確認用のアシスタント
デモシステムはArxivCopilot – a Hugging Face Space by ulab-ai

TravelAgent: An AI Assistant for Personalized Travel Planning [36.0]
大規模言語モデル(LLM)を利用した旅行計画システムであるTravelAgentを紹介する。 TravelAgentはツール使用、推奨、計画、メモリモジュールの4つのモジュールで構成されている。我々は,TravelAgentの性能を人間とシミュレーションユーザで評価し,その全体的な効果を3つの基準で示し,パーソナライズされたレコメンデーションの精度を確認した。
論文参考訳（メタデータ） (Thu, 12 Sep 2024 14:24:45 GMT)
旅行計画用のエージェント、構築方法など参考になる。

Agent Workflow Memory

Agent Workflow Memory [71.8]
本稿では、一般的に再利用されるルーチンを誘導するAgent Memoryを紹介する。 AWMはベースラインの結果を24.6%、相対的な成功率51.1%で大幅に改善する。オンラインAWMは、クロスタスク、ウェブサイト、ドメイン評価を強力に一般化する。
論文参考訳（メタデータ） (Wed, 11 Sep 2024 17:21:00 GMT)
「AWM induces workflows from agent trajectories by extracting reusable routines, and then integrates these workflows into agent memory to guide future task-solving processes.」というフレームワークの提案。過去の経験を一般化し貯める動的メモリのイメージで、オフラインシナリオだけでなくオンラインでも有効とのこと。
リポジトリはGitHub – zorazrw/agent-workflow-memory: AWM: Agent Workflow Memory

OpenAI o1

先週の最大のニュースは今まで様々なうわさがあった、OpenAI o1 Introducing OpenAI o1 | OpenAIの公開だろう。特にSTEM分野で強力な性能を発揮している。

技術的な情報は公開されていない部分が多いが、Learning to Reason with LLMs | OpenAIに書かれている「Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.」では自己改善・合成データ活用の流れ（self-X – arXiv最新論文の紹介 (devneko.jp)、Synthetic data – arXiv最新論文の紹介 (devneko.jp)）に近いのかなと思う。

開発者向けの質問回答では

OpenAI o-1はモデルであってシステムではない、（ユーザには見せない）長い推論過程を生成するモデルである
GPT-4oのプロンプトエンジニアリングによってOpenAI o-1の性能と競合することはできない
RAGはOpenAI o-1においても有効

など興味深い質疑があったよう。詳細の開示はないだろうが、何らかのテクニカルレポートが欲しいところ。現時点では最近の研究動向から大きく外れたものではないし、性能の改善幅や使用感からして大きな驚きはないというのが正直な感想。1モデルにするのが良いのか、システム（Agenticな動作）にしたうえでそれに対応するモデル（Agenticな動きに特化したモデルと、通常の推論に適したモデルなど）の組み合わせのほうが良いのかなど気になるところではある。o-1は前者とのことだが、外部ツール利用を考えたとき制約が大きくなりそうな気がしている。

今後、エージェント的動作を行う場合を含め様々なベンチマークで評価されているのだろうと思うが、Cybench（Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models – arXiv最新論文の紹介 (devneko.jp)）では「Subtasks % Solved: Percentage of subtasks solved per task, macro-averaged across the tasks.」が向上している（GPT-4oで負けていたClaude 3.5 Sonnetを抜いた）一方でSuccessRateはGPT-4oに及んでいない。

SYNTHETIC CONTINUED PRETRAINING

Synthetic continued pretraining [29.7]
与えられた事実を学ぶためには、モデルは数百から数千の多様な表現で訓練されなければならない。本研究では,より学習しやすい大規模コーパスを合成するための合成継続事前学習を提案する。合成データ拡張アルゴリズムであるEntiGraphでこの提案をインスタンス化する。
論文参考訳（メタデータ） (Wed, 11 Sep 2024 17:21:59 GMT)
ナレッジグラフを介して合成データを構築するEntiGraphの提案。「Synthetic continued pretraining with EntiGraph demonstrates consistent scaling in downstream closed-book QA performance up to a 600M token synthetic corpus, whereas baselines such as continued pretraining on the small corpus or synthetic paraphrases show no improvement or asymptote early.」とのことで有効性を確認
抽象的な「知識」を介したほうが、表現の変換よりも良い（学習に利用可能な）情報を提供できるという解釈で良いのだろうか。

Large Language Model-Based Agents for Software Engineering: A Survey

Large Language Model-Based Agents for Software Engineering: A Survey [20.3]
近年のLarge Language Models(LLM)の進歩は、AIエージェント、すなわちLLMベースのエージェントの新しいパラダイムを形成している。我々は106の論文を収集し、それらを2つの視点、すなわちSEとエージェントの観点から分類する。さらに、この重要な領域におけるオープンな課題と今後の方向性についても論じる。
論文参考訳（メタデータ） (Wed, 04 Sep 2024 15:59:41 GMT)
ソフトウェアエンジニアリングにおけるLLM based Agentのサーベイ
リポジトリもある。GitHub – FudanSELab/Agent4SE-Paper-List: Repository for the paper “Large Language Model-Based Agents for Software Engineering: A Survey”.

The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers by Zheyuan (Kevin) Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, Tobias Salz :: SSRN
「Though each separate experiment is noisy, combined across all three experiments and 4,867 software developers, our analysis reveals a 26.08% increase (SE: 10.3%) in the number of completed tasks among developers using the AI tool.」という報告もあり、ソフトウェアエンジニアリングにおけるAI活用はどんどん進んでいくのだろうか。

xLAM: A Family of Large Action Models to Empower AI Agent Systems / ToolACE: Winning the Points of LLM Function Calling

xLAM: A Family of Large Action Models to Empower AI Agent Systems [111.6]
AIエージェントタスク用に設計された大規模なアクションモデルであるxLAMをリリースする。 xLAMは、複数のエージェント能力ベンチマークで例外的なパフォーマンスを提供する。
論文参考訳（メタデータ） (Thu, 05 Sep 2024 03:22:22 GMT)
Salesforce AI Researchによるエージェント動作に適したモデルの提案。データセットの統合・拡張で合成データ関連の手法をうまく活用している。ソースコードはApache-2ライセンス。モデルは公開されているが商用利用不可のCC-BY-NC。性能は「Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use.」とのこと。「The insights we learned from training these models highlight the importance of rigorous data processing and the potential of data synthesis in developing capable AI agents.」という記載もあり、合成データの活用について重要性が上がっているように見える。
リポジトリはGitHub – SalesforceAIResearch/xLAM、xLAM models – a Salesforce Collection (huggingface.co)

Berkeley Function-Calling Leaderboardについては下記論文も発表されている。こちらも合成データを用いるアプローチ

ToolACE: Winning the Points of LLM Function Calling [139.1]
ToolACEは、正確で複雑で多様なツール学習データを生成するように設計された自動エージェントパイプラインである。我々は、合成データに基づいてトレーニングされたモデルが、8Bパラメータだけで、バークレー・ファンクション・カリング・リーダーボード上で最先端のパフォーマンスを達成することを実証した。
論文参考訳（メタデータ） (Mon, 02 Sep 2024 03:19:56 GMT)
the Berkeley Function-Calling Leaderboardへの「Tool Self-evolution Synthesis (TSS), Multi-Agent Interactive Dialog Generation (MAI), and Dual-Layer Validation Process (DLV).」からなるパイプライン構成（Agenticな）データ合成による対応
リポジトリはTeam-ACE (Team-ACE) (huggingface.co)

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models [33.2]
Cybenchは、サイバーセキュリティタスクを特定し、それらのタスク上でエージェントを評価するためのフレームワークである。エージェント能力を評価するために,gpt-4o,claude 3 opus,claude 3.5 sonnet,mixtral 8x22b instruct,gemini 1.5 pro,llama 3 70b chat,llama 3.1 405b instructの7モデルを評価する。
論文参考訳（メタデータ） (Thu, 15 Aug 2024 17:23:10 GMT)
CTFコンペから抽出したタスクをLLMが解けるかのベンチマーク。ガイドなしだとまだまだ難しそうな感じ。閲覧時点ではClaude 3.5 Sonnet > GPT-4o > Claude 3 Opusで、オープン系のLlama 3.1 405B Instructは商用モデルに比べてかなり性能が低い。
リポジトリはCybench

Re-Thinking Process Mining in the AI-Based Agents Era

Re-Thinking Process Mining in the AI-Based Agents Era [39.6]
大規模言語モデル(LLM)は強力な対話インタフェースとして登場し、プロセスマイニング(PM)タスクにおけるその応用は有望な結果を示している。本稿では,LLMにおけるPMの有効性を高めるために,AIベースのエージェント(AgWf)パラダイムを活用することを提案する。我々はAgWfの様々な実装とAIベースのタスクの種類について検討する。
論文参考訳（メタデータ） (Wed, 14 Aug 2024 10:14:18 GMT)
LLM時代のプロセスマイニング、GitHub – crewAIInc/crewAI: Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.ベースのトライアルがGitHub – fit-alessandro-berti/agents-trial: agents-trialにある。

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [32.7]
本手法はモンテカルロ木探索とLLMに基づく反射による自己再生シミュレーションにより品質フィードバックを収集する。本手法は,従来の強化学習手法よりも優れた性能でエージェントを訓練する上で有効であることを示す。
論文参考訳（メタデータ） (Tue, 20 Aug 2024 08:22:04 GMT)
「 (1) reflection and idea generation step and (2) the strategy improvement step」を繰り返しながら自己改善していく手法の提案。有効そう。
リポジトリはStrategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search (llm-strategist.github.io)

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31