Autonomous Agent – ページ 13 – arXiv最新論文の紹介

Agent Workflow Memory

Agent Workflow Memory [71.8]
本稿では、一般的に再利用されるルーチンを誘導するAgent Memoryを紹介する。 AWMはベースラインの結果を24.6%、相対的な成功率51.1%で大幅に改善する。オンラインAWMは、クロスタスク、ウェブサイト、ドメイン評価を強力に一般化する。
論文参考訳（メタデータ） (Wed, 11 Sep 2024 17:21:00 GMT)
「AWM induces workflows from agent trajectories by extracting reusable routines, and then integrates these workflows into agent memory to guide future task-solving processes.」というフレームワークの提案。過去の経験を一般化し貯める動的メモリのイメージで、オフラインシナリオだけでなくオンラインでも有効とのこと。
リポジトリはGitHub – zorazrw/agent-workflow-memory: AWM: Agent Workflow Memory

OpenAI o1

先週の最大のニュースは今まで様々なうわさがあった、OpenAI o1 Introducing OpenAI o1 | OpenAIの公開だろう。特にSTEM分野で強力な性能を発揮している。

技術的な情報は公開されていない部分が多いが、Learning to Reason with LLMs | OpenAIに書かれている「Our large-scale reinforcement learning algorithm teaches the model how to think productively using its chain of thought in a highly data-efficient training process. We have found that the performance of o1 consistently improves with more reinforcement learning (train-time compute) and with more time spent thinking (test-time compute). The constraints on scaling this approach differ substantially from those of LLM pretraining, and we are continuing to investigate them.」では自己改善・合成データ活用の流れ（self-X – arXiv最新論文の紹介 (devneko.jp)、Synthetic data – arXiv最新論文の紹介 (devneko.jp)）に近いのかなと思う。

開発者向けの質問回答では

OpenAI o-1はモデルであってシステムではない、（ユーザには見せない）長い推論過程を生成するモデルである
GPT-4oのプロンプトエンジニアリングによってOpenAI o-1の性能と競合することはできない
RAGはOpenAI o-1においても有効

など興味深い質疑があったよう。詳細の開示はないだろうが、何らかのテクニカルレポートが欲しいところ。現時点では最近の研究動向から大きく外れたものではないし、性能の改善幅や使用感からして大きな驚きはないというのが正直な感想。1モデルにするのが良いのか、システム（Agenticな動作）にしたうえでそれに対応するモデル（Agenticな動きに特化したモデルと、通常の推論に適したモデルなど）の組み合わせのほうが良いのかなど気になるところではある。o-1は前者とのことだが、外部ツール利用を考えたとき制約が大きくなりそうな気がしている。

今後、エージェント的動作を行う場合を含め様々なベンチマークで評価されているのだろうと思うが、Cybench（Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models – arXiv最新論文の紹介 (devneko.jp)）では「Subtasks % Solved: Percentage of subtasks solved per task, macro-averaged across the tasks.」が向上している（GPT-4oで負けていたClaude 3.5 Sonnetを抜いた）一方でSuccessRateはGPT-4oに及んでいない。

SYNTHETIC CONTINUED PRETRAINING

Synthetic continued pretraining [29.7]
与えられた事実を学ぶためには、モデルは数百から数千の多様な表現で訓練されなければならない。本研究では,より学習しやすい大規模コーパスを合成するための合成継続事前学習を提案する。合成データ拡張アルゴリズムであるEntiGraphでこの提案をインスタンス化する。
論文参考訳（メタデータ） (Wed, 11 Sep 2024 17:21:59 GMT)
ナレッジグラフを介して合成データを構築するEntiGraphの提案。「Synthetic continued pretraining with EntiGraph demonstrates consistent scaling in downstream closed-book QA performance up to a 600M token synthetic corpus, whereas baselines such as continued pretraining on the small corpus or synthetic paraphrases show no improvement or asymptote early.」とのことで有効性を確認
抽象的な「知識」を介したほうが、表現の変換よりも良い（学習に利用可能な）情報を提供できるという解釈で良いのだろうか。

Large Language Model-Based Agents for Software Engineering: A Survey

Large Language Model-Based Agents for Software Engineering: A Survey [20.3]
近年のLarge Language Models(LLM)の進歩は、AIエージェント、すなわちLLMベースのエージェントの新しいパラダイムを形成している。我々は106の論文を収集し、それらを2つの視点、すなわちSEとエージェントの観点から分類する。さらに、この重要な領域におけるオープンな課題と今後の方向性についても論じる。
論文参考訳（メタデータ） (Wed, 04 Sep 2024 15:59:41 GMT)
ソフトウェアエンジニアリングにおけるLLM based Agentのサーベイ
リポジトリもある。GitHub – FudanSELab/Agent4SE-Paper-List: Repository for the paper “Large Language Model-Based Agents for Software Engineering: A Survey”.

The Effects of Generative AI on High Skilled Work: Evidence from Three Field Experiments with Software Developers by Zheyuan (Kevin) Cui, Mert Demirer, Sonia Jaffe, Leon Musolff, Sida Peng, Tobias Salz :: SSRN
「Though each separate experiment is noisy, combined across all three experiments and 4,867 software developers, our analysis reveals a 26.08% increase (SE: 10.3%) in the number of completed tasks among developers using the AI tool.」という報告もあり、ソフトウェアエンジニアリングにおけるAI活用はどんどん進んでいくのだろうか。

xLAM: A Family of Large Action Models to Empower AI Agent Systems / ToolACE: Winning the Points of LLM Function Calling

xLAM: A Family of Large Action Models to Empower AI Agent Systems [111.6]
AIエージェントタスク用に設計された大規模なアクションモデルであるxLAMをリリースする。 xLAMは、複数のエージェント能力ベンチマークで例外的なパフォーマンスを提供する。
論文参考訳（メタデータ） (Thu, 05 Sep 2024 03:22:22 GMT)
Salesforce AI Researchによるエージェント動作に適したモデルの提案。データセットの統合・拡張で合成データ関連の手法をうまく活用している。ソースコードはApache-2ライセンス。モデルは公開されているが商用利用不可のCC-BY-NC。性能は「Our experimental results demonstrate that xLAM consistently delivers exceptional performance across multiple agent ability benchmarks, notably securing the 1st position on the Berkeley Function-Calling Leaderboard, outperforming GPT-4, Claude-3, and many other models in terms of tool use.」とのこと。「The insights we learned from training these models highlight the importance of rigorous data processing and the potential of data synthesis in developing capable AI agents.」という記載もあり、合成データの活用について重要性が上がっているように見える。
リポジトリはGitHub – SalesforceAIResearch/xLAM、xLAM models – a Salesforce Collection (huggingface.co)

Berkeley Function-Calling Leaderboardについては下記論文も発表されている。こちらも合成データを用いるアプローチ

ToolACE: Winning the Points of LLM Function Calling [139.1]
ToolACEは、正確で複雑で多様なツール学習データを生成するように設計された自動エージェントパイプラインである。我々は、合成データに基づいてトレーニングされたモデルが、8Bパラメータだけで、バークレー・ファンクション・カリング・リーダーボード上で最先端のパフォーマンスを達成することを実証した。
論文参考訳（メタデータ） (Mon, 02 Sep 2024 03:19:56 GMT)
the Berkeley Function-Calling Leaderboardへの「Tool Self-evolution Synthesis (TSS), Multi-Agent Interactive Dialog Generation (MAI), and Dual-Layer Validation Process (DLV).」からなるパイプライン構成（Agenticな）データ合成による対応
リポジトリはTeam-ACE (Team-ACE) (huggingface.co)

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models

Cybench: A Framework for Evaluating Cybersecurity Capabilities and Risk of Language Models [33.2]
Cybenchは、サイバーセキュリティタスクを特定し、それらのタスク上でエージェントを評価するためのフレームワークである。エージェント能力を評価するために,gpt-4o,claude 3 opus,claude 3.5 sonnet,mixtral 8x22b instruct,gemini 1.5 pro,llama 3 70b chat,llama 3.1 405b instructの7モデルを評価する。
論文参考訳（メタデータ） (Thu, 15 Aug 2024 17:23:10 GMT)
CTFコンペから抽出したタスクをLLMが解けるかのベンチマーク。ガイドなしだとまだまだ難しそうな感じ。閲覧時点ではClaude 3.5 Sonnet > GPT-4o > Claude 3 Opusで、オープン系のLlama 3.1 405B Instructは商用モデルに比べてかなり性能が低い。
リポジトリはCybench

Re-Thinking Process Mining in the AI-Based Agents Era

Re-Thinking Process Mining in the AI-Based Agents Era [39.6]
大規模言語モデル(LLM)は強力な対話インタフェースとして登場し、プロセスマイニング(PM)タスクにおけるその応用は有望な結果を示している。本稿では,LLMにおけるPMの有効性を高めるために,AIベースのエージェント(AgWf)パラダイムを活用することを提案する。我々はAgWfの様々な実装とAIベースのタスクの種類について検討する。
論文参考訳（メタデータ） (Wed, 14 Aug 2024 10:14:18 GMT)
LLM時代のプロセスマイニング、GitHub – crewAIInc/crewAI: Framework for orchestrating role-playing, autonomous AI agents. By fostering collaborative intelligence, CrewAI empowers agents to work together seamlessly, tackling complex tasks.ベースのトライアルがGitHub – fit-alessandro-berti/agents-trial: agents-trialにある。

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search

Strategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search [32.7]
本手法はモンテカルロ木探索とLLMに基づく反射による自己再生シミュレーションにより品質フィードバックを収集する。本手法は,従来の強化学習手法よりも優れた性能でエージェントを訓練する上で有効であることを示す。
論文参考訳（メタデータ） (Tue, 20 Aug 2024 08:22:04 GMT)
「 (1) reflection and idea generation step and (2) the strategy improvement step」を繰り返しながら自己改善していく手法の提案。有効そう。
リポジトリはStrategist: Learning Strategic Skills by LLMs via Bi-Level Tree Search (llm-strategist.github.io)

Automated Design of Agentic Systems

Automated Design of Agentic Systems [5.4]
我々は,エージェントシステムの設計を自動生成することを目的とした,エージェントシステムの自動設計という新しい研究領域を定式化する。我々のアルゴリズムは、最先端の手作りエージェントを大幅に上回る斬新なデザインでエージェントを段階的に発明できることが示される。
論文参考訳（メタデータ） (Thu, 15 Aug 2024 21:59:23 GMT)
「Automated Design of Agentic Systems (ADAS) involves using a search algorithm to discover agentic systems across a search space that optimize an evaluation function.」という分野、および、Meta Agent Searchという名前でLLMを用いて様々なブロックを組み合わせたコードを生成していく手法提案、有効性を確認とのこと。
目標が定まっていればエージェントシステムのデザインも自動化していく可能性は当然あると思う。目標相当のモノの大きさが重要だが、それが解くべき実課題と同じレベルに達するのはいつになるのだろうか。（意外と早い気もしつつ）
上記が実現するまではDifyやGitHub – modelscope/agentscope: Start building LLM-empowered multi-agent applications in an easier way.（Very Large-Scale Multi-Agent Simulation in AgentScope – arXiv最新論文の紹介 (devneko.jp)）のようなもので対応することになるのだろうか。
プロジェクトサイトは、ADAS (shengranhu.com)、リポジトリはGitHub – ShengranHu/ADAS: Automated Design of Agentic Systems

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents

Diversity Empowers Intelligence: Integrating Expertise of Software Engineering Agents [106.9]
大規模言語モデル(LLM)エージェントは、現実世界のソフトウェア工学(SWE)問題を解決する大きな可能性を示している。専門知識を活かしたフレームワークであるDEI(Diversity Empowered Intelligence)を提案する。実験により、DEAが指導するエージェント委員会が、最高のエージェントのパフォーマンスを大きなマージンで上回ることが可能であることが示されている。
論文参考訳（メタデータ） (Tue, 13 Aug 2024 17:50:28 GMT)
様々なところで研究開発が進む、ソフトウエア開発に関する自立型エージェント。本件はSalesforceの研究で「DEI aims to harness these varied skills to tackle a broader range of problems more effectively with a multi-agent ensemble system and a re-ranking pipeline」というアプローチ（DEI =Diversity Empowers Intelligence ）
標準的ベンチマークが確立すると研究開発や分析が高速に進む。。。
リポジトリはSalesforce Research DEI Agents (salesforce-research-dei-agents.github.io)

2026年2月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28