self-X – ページ 3 – arXiv最新論文の紹介

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [27.2]
SPIRALは、モデルをマルチターン、ゼロサムゲームで学習し、自身のバージョンを継続的に改善するセルフプレイフレームワークである。 SPIRALを用いることで、ゼロサムゲーム上でのセルフプレイは、広く移動する推論能力を生み出す。分析により, この伝達は, 系統的分解, 期待値計算, ケース・バイ・ケース分析という3つの認知的パターンを通じて起こることが明らかとなった。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 17:58:13 GMT)
人への依存を少なくするため「We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision.」というフレームワークを提案、効果を確認とのこと。「Key Findings. Training on zero-sum games produces reasoning capabilities that transfer broadly.」としている。「Our empirical results show that training on Kuhn Poker alone improves mathematical reasoning by 8.7% average and Minerva Math by 18.1%, surpassing models trained on 25,000 expert demonstrations」とSFTを上回っているのは若干驚き。
リポジトリはGitHub – spiral-rl/spiral: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlphaEvolve: A coding agent for scientific and algorithmic discovery [63.1]
我々は,最先端LLMの能力を大幅に向上させる進化的符号化エージェントAlphaEvolveを提案する。 AlphaEvolveはLLMの自律パイプラインを編成し、そのタスクはコードを直接変更することでアルゴリズムを改善することである。本稿では,多くの重要な計算問題に適用することで,このアプローチの広範な適用性を実証する。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 06:37:18 GMT)
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms – Google DeepMindの論文がarXivに出ていた

Agents of Change: Self-Evolving LLM Agents for Strategic Planning

Agents of Change: Self-Evolving LLM Agents for Strategic Planning [17.7]
我々は、シンプルなゲームプレイングエージェントから、自身のプロンプトとプレイヤーエージェントのコードを自動で書き直すことができるシステムまで、LSMベースのエージェントの進歩をベンチマークする。以上の結果から,特にClaude 3.7 や GPT-4o などのモデルによって駆動される自己進化型エージェントは,その戦略を自律的に採用することで,静的ベースラインを上回っていることがわかった。
論文参考訳（メタデータ） (Thu, 05 Jun 2025 05:45:24 GMT)
カタンの開拓者を対象として Self-Evolving Agent Frameworkの提案と検証。
「Through extensive experiments, we show that agents capable of prompt and code evolution achieve consistently higher performance than static baselines. The PromptEvolver, in particular, outperforms fixed agents across key metrics, and its gains are amplified when paired with stronger base models, seen in Claude 3.7’s 95% improvement from the BaseAgent」とのこと。PromptEvolverには「Evolver Agent: Provided with access to game results, evolution history, and tools to search the web, view local files, and edit the Player Agent’s prompt.」が含まれている。
プロンプトやコードといった思考能力たるWeight外のself-improveも十分効果的のよう。（ICLが有効と考えれば一定思考能力を改善しているともいえるのか・・・？）

Boosting LLM Reasoning via Spontaneous Self-Correction

Boosting LLM Reasoning via Spontaneous Self-Correction [43.5]
数学推論を改善するためのアプローチの1つは自己補正である。既存の自己補正アプローチは、修正を独立したポストジェネレーションとして扱う。本研究では,LLMが単一推論パスでインターリーブされた解と検証を生成できる自己補正手法であるSPOCを提案する。
論文参考訳（メタデータ） (Sat, 07 Jun 2025 21:23:00 GMT)
「we introduce SPOC, a spontaneous self-correction approach that enables LLMs to spontaneously generate interleaved solutions and verifications in a single inference pass.」とCoT（ToT）とLRMの関係を思い出すアプローチ。
この手の強化を行ったモデルをMoA的に束ねるのが良いのか、いろいろなものを一つのモデルが吸収していくのか、興味があるところ。

Self-Adapting Language Models

Self-Adapting Language Models [44.5]
大規模言語モデル(LLM)は強力だが静的であり、新しいタスクや知識、例に対応して重みを適応するメカニズムが欠如している。我々は,自己適応型LSM(Self-Adapting LLMs, SEAL)を導入する。知識の定式化と数ショットの一般化の実験により、SEALは自己指向適応が可能な言語モデルに向けた有望なステップであることが示された。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 17:48:13 GMT)
「We propose Self-Adapting LLMs (SEAL), a framework that enables language models to improve themselves by generating their own synthetic data and optimization parameters (“self-edits”) in re- sponse to new data. The model is trained to produce these self-edits directly through token generation with the data provided in the model’s context. Self-edit generation is learned via reinforcement learning (RL) where the model is rewarded for generating self-edits (SE) that, when applied, improve the model’s performance at the target task.」という自己適合、自己進化、自己改善のアプローチ。SQuADやARC-AGI benchmark（のサブセット）を用いて効果を検証している。
合成データを介しての自己改善はやはり有効そうという印象。（今でも一定実用的であると思うが）AGIとかいう世界観を考えると時間的制約が解消できるかがポイントだろうか。（AIにも睡眠が必要と言いつつこの手の処理を行うような少し未来が妄想される）
プロジェクトサイトはSelf-Adapting Language Models

Self-Adapting Improvement Loops for Robotic Learning [30.8]
専門家によるデモンストレーションで訓練されたビデオ生成モデルは、ロボットタスクを解くためのパフォーマンスの高いテキスト条件付きビジュアルプランナーとして利用されてきた。本研究では,自己生成トラジェクトリ上で,ドメイン内ビデオモデルを反復的に更新する自己改善ループ(SAIL)を提案する。従来のドメイン内ビデオモデルトレーニングでは,新規タスクの繰り返しに対して,パフォーマンスが継続的に向上することが確認できた。
論文参考訳（メタデータ） (Sat, 07 Jun 2025 04:34:37 GMT)
「we highlight that adaptation with large-scale pretrained text-conditioned video models is critical for facilitating self-improvement, by contributing text-conditioned generalization capabilities and motion priors.」とこちらは動画生成モデルを活用するアプローチ。
プロジェクトサイトはSAIL

Self-Challenging Language Model Agents

Self-Challenging Language Model Agents [98.6]
本稿では,エージェントが自ら生成する高品質なタスクについて,エージェントを訓練するためのセルフチェンジフレームワークを提案する。このフレームワークは、Llama-3.1-8B-Instructの2倍の改善を実現している。
論文参考訳（メタデータ） (Mon, 02 Jun 2025 14:23:33 GMT)
「we present the Self-Challenging Agent (SCA) method for self-improvement of general multi-turn tool-use LLM agents. SCA can create its own tasks to challenge itself and learn from them. To do this, it utilizes the Code-as-Task (CaT) formulation which ensures high quality synthetic tasks. Through RL on these self-generated synthetic tasks, SCA can be used to train a Llama-3.1-8B model to achieve an average relative success rate improvement of 95.8% on existing test tasks across four different multi-turn tool-use environments.」とのこと。。。AGIに近づいている感のある未来を感じる報告。（「While SCA serves as a preliminary step, there remains many research questions for building an effective self-improvement flywheel for general LLM agents.」とあるとおり、実態上はまだいろいろ壁はあるのだろうが）
コード生成を効果的に使っているのも興味深いが、形式言語で表されるようなタスクは解ける段階というのは意外と早く来るのだろうか。。。

Think Only When You Need with Large Hybrid-Reasoning Models

Think Only When You Need with Large Hybrid-Reasoning Models [121.6]
LHRM(Large Hybrid-Reasoning Model) ユーザクエリのコンテキスト情報に基づいて思考を行うか否かを適応的に決定できるモデル。実験の結果, LHRMsは, 様々な難易度, 種別の問合せに対して, 適応的にハイブリッド思考を行うことができた。
論文参考訳（メタデータ） (Wed, 21 May 2025 05:17:34 GMT)
LLM, LRMハイブリッドな手法の提案。「We begin with a hybrid-formatted supervised fine-tuning stage named Hybrid Fine-Tuning (HFT) that integrates both reasoning-intensive (Thinking) and direct-answer (No-Thinking) data. This approach mitigates the instability often observed in cold-start scenarios [GYZ+25], and establishes a robust initialization for next stage reinforcement learning.」という第一ステージを挟んでいるのが面白い。
LHRMという略語が定着する可能性があるのかは若干気になる。
リポジトリはAdvancing AI for Humanity

Let LLMs Break Free from Overthinking via Self-Braking Tuning [60.1]
大きな推論モデル(LRM)は思考の長い連鎖を生成することによって推論能力を著しく向上させた。この性能向上は、生成プロセス中の冗長な推論を大幅に増加させるコストが伴う。本稿では、モデルが独自の推論プロセスを制御することを許容する観点から、過度に検討する新しいフレームワーク、Self-Braking Tuning(SBT)を提案する。
論文参考訳（メタデータ） (Tue, 20 May 2025 16:53:40 GMT)
「we propose a novel endogenous approach, Self-Braking Tuning (SBT), to mitigating overthinking in large language models.」とtoken節約という意味では近い内容。
リポジトリはGitHub – ZJU-REAL/Self-Braking-Tuning: Let LLMs Break Free from Overthinking via Self-Braking Tuning

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.6]
セルフプレイ批判(Self-Play Critic、SPC)は、対戦型セルフプレイゲームを通じて推論ステップを評価する能力を進化させる新しいアプローチである。 SPCは、ベースモデルの2つのコピーを微調整して、2つの役割、すなわち「スニーキージェネレータ」と「批判的」を演じる。
論文参考訳（メタデータ） (Sun, 27 Apr 2025 08:45:06 GMT)
「In this paper, we propose a self-play critic with the ability of detecting step-level LLMs reasoning errors. Specifically, we design a sneaky generator to produce incorrect steps and a critic to assess the correctness of each step. Through the adversarial game between these two models, we can continuously generate positive and negative samples for reinforcement learning.」というアプローチの提案。GANっぽいなと思う。
プロジェクトサイトはSPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

Self-Taught Self-Correction for Small Language Models

Self-Taught Self-Correction for Small Language Models [16.5]
本研究は,自己生成データのみを用いた反復的微調整により,小言語モデル(SLM)における自己補正を探索する。複数のアルゴリズム設計選択を組み込んだ自己学習自己補正アルゴリズム(STaSC)を導入する。質問応答タスクの実験結果から,STaSCは自己補正を効果的に学習し,性能が大幅に向上することが示された。
論文参考訳（メタデータ） (Tue, 11 Mar 2025 17:57:44 GMT)
STaRに自己補正を様々組み込んだSelf-Taught Self-Correction (STaSC)の提案。
リポジトリはGitHub – VityaVitalich/STASC: [ICLR 2025 SSI-FM] Self-Taught Self-Correction for Small Language Models

START: Self-taught Reasoner with Tools

START: Self-taught Reasoner with Tools [51.4]
ツール統合長チェーン・オブ・シークレット(CoT)推論LSMであるSTART(Self-Taught Reasoner with Tools)を紹介する。 STARTは複雑な計算、自己チェック、多様な方法の探索、そして自己老化を行うことができる。基礎となるQwQ-32Bを著しく上回り、最先端のオープンウェイトモデルR1-Distill-Qwen-32Bに匹敵する性能を達成する。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 17:11:51 GMT)
ツール統合型のCoTを行うSTART (Self-Taught Reasoner with Tools)の提案、「Hint-infer: code/math data is processed by QwQ, with responses truncated at predefined terminators. Context-aware hints from a Hint-Library are injected at truncation points (including endpoints), and QwQ resumes inference using a code interpreter for Python execution feedback.」と「b) Hint-RFT: Hint-infer outputs undergo rule-based scoring, filtering, and content modification to create Dseed .」の２つがキーポイント。ルール・テンプレートをうまく統合していっている印象で、この手の工夫は色々あり得そう。

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31