2025年6月19日 – arXiv最新論文の紹介

Agents of Change: Self-Evolving LLM Agents for Strategic Planning

Agents of Change: Self-Evolving LLM Agents for Strategic Planning [17.7]
我々は、シンプルなゲームプレイングエージェントから、自身のプロンプトとプレイヤーエージェントのコードを自動で書き直すことができるシステムまで、LSMベースのエージェントの進歩をベンチマークする。以上の結果から,特にClaude 3.7 や GPT-4o などのモデルによって駆動される自己進化型エージェントは,その戦略を自律的に採用することで,静的ベースラインを上回っていることがわかった。
論文参考訳（メタデータ） (Thu, 05 Jun 2025 05:45:24 GMT)
カタンの開拓者を対象として Self-Evolving Agent Frameworkの提案と検証。
「Through extensive experiments, we show that agents capable of prompt and code evolution achieve consistently higher performance than static baselines. The PromptEvolver, in particular, outperforms fixed agents across key metrics, and its gains are amplified when paired with stronger base models, seen in Claude 3.7’s 95% improvement from the BaseAgent」とのこと。PromptEvolverには「Evolver Agent: Provided with access to game results, evolution history, and tools to search the web, view local files, and edit the Player Agent’s prompt.」が含まれている。
プロンプトやコードといった思考能力たるWeight外のself-improveも十分効果的のよう。（ICLが有効と考えれば一定思考能力を改善しているともいえるのか・・・？）

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

Mirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills [57.7]
本稿では,知識不足の問題に対処するため,階層型マルチモーダルスキル(HMS)モジュールを提案する。トラジェクトリを実行スキル、コアスキル、そして最終的にはメタスキルに徐々に抽象化し、長期のタスク計画のための階層的な知識構造を提供する。ドメインギャップを埋めるために,Skill-Augmented Monte Carlo Tree Search (SA-MCTS)アルゴリズムを提案する。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 06:21:19 GMT)
「Hierarchical Multimodal Skills (HMS) module for long-horizon planning」、「A Skill-Augmented Monte Carlo Tree Search (SA-MCTS) algorithm for knowledge exploration in online settings.」をキーとするcross-platform, plug-and-play GUI agent、Mirage-1の提案
プロジェクトサイトはMirage-1: Augmenting and Updating GUI Agent with Hierarchical Multimodal Skills

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning [43.7]
FinChainは、検証可能なChain-of-Thought(CoT)金融推論のための最初のシンボリックベンチマークである。 FinChainはトピック毎に5つのパラメータ化されたテンプレートを提供する。データセット上で30 LLMをベンチマークすると、最先端モデルでさえ改善の余地がかなりあることが分かります。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 06:44:42 GMT)
金融分野、CoTのベンチマーク。「We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Bench- marking 30 LLMs on our dataset, we find that even state-of-the-art models have consider- able room for improvement in multi-step finan- cial reasoning.」と推論過程を評価するフレームワークも提案。
リポジトリはGitHub – mbzuai-nlp/finchain: A symbolic benchmark for verifiable chain-of-thought financial reasoning. Includes executable templates, 54 topics across 12 domains, and ChainEval metrics.

月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30