2025年7月8日 – arXiv最新論文の紹介

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [27.2]
SPIRALは、モデルをマルチターン、ゼロサムゲームで学習し、自身のバージョンを継続的に改善するセルフプレイフレームワークである。 SPIRALを用いることで、ゼロサムゲーム上でのセルフプレイは、広く移動する推論能力を生み出す。分析により, この伝達は, 系統的分解, 期待値計算, ケース・バイ・ケース分析という3つの認知的パターンを通じて起こることが明らかとなった。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 17:58:13 GMT)
人への依存を少なくするため「We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision.」というフレームワークを提案、効果を確認とのこと。「Key Findings. Training on zero-sum games produces reasoning capabilities that transfer broadly.」としている。「Our empirical results show that training on Kuhn Poker alone improves mathematical reasoning by 8.7% average and Minerva Math by 18.1%, surpassing models trained on 25,000 expert demonstrations」とSFTを上回っているのは若干驚き。
リポジトリはGitHub – spiral-rl/spiral: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities

Graphs Meet AI Agents: Taxonomy, Progress, and Future Opportunities [117.5]
データ構造化は、複雑で非組織的なデータをよく構造化された形式に変換することで、有望な役割を果たす。この調査では、グラフがAIエージェントにどのように権限を与えるかを、初めて体系的にレビューする。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 12:59:12 GMT)
グラフとエージェントに関するサーベイ
リポジトリはGitHub – YuanchenBei/Awesome-Graphs-Meet-Agents: A curated list of resources on graph-empowered agents and agent-facilitated graph learning (Graphs Meet Agents).

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.7]
多くのエージェントベンチマークではタスク設定や報酬設計が問題となっている。このような問題は、相対的な用語で、過小評価または過大評価エージェントのパフォーマンスを最大100%向上させる可能性がある。我々はベンチマーク構築経験から要約したガイドラインの集合であるAgentic Benchmark Checklist (ABC)を紹介した。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:35:31 GMT)
構築が難しいエージェント系ベンチマークの注意点をまとめた論文。
「the issues found in τ-bench-Airline, some other example issues we found are: (1) an agent can score 100% on SWE-Lancer without resolving any tasks;」のような問題は相応にある気がするし、「Based on ABC, we assessed ten widely used agentic benchmarks and identified significant evaluation issues that cases up to 100% errors (in relative terms) when estimating agents’ performance.」も驚愕という感じではない。
リポジトリはGitHub – uiuc-kang-lab/agentic-benchmarks

MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real

MultiGen: Using Multimodal Generation in Simulation to Learn Multimodal Policies in Real [128.8]
MultiGenは、大規模な生成モデルを従来の物理シミュレータに統合するフレームワークである。容器や液体を注ぐ現実世界への効果的なゼロショット転送を実証する。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:59:58 GMT)
「In this work, we introduced MULTIGEN, a novel framework for integrating generative multimodal simulation into robot learning. By augmenting physics-based simulators with large-scale generative models, we demonstrated that sim-to-real policy learning can leverage rich sensory feedback beyond vision and proprioception.」というフレームワークの提案
音声合成データを併用するのが興味深いところ。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31