self-X – arXiv最新論文の紹介

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning

Refine-n-Judge: Curating High-Quality Preference Chains for LLM-Fine-Tuning [14.3]
大規模言語モデル(LLM)は、好みに基づく微調整を通じて顕著な進歩を見せている。本稿では、1つのLCMを精細化と判定の両方に活用し、データセットの品質を向上させる自動反復手法であるRefine-n-Judgeを紹介する。本研究では,5つのコーパスにまたがる公開データセットにまたがるRefine-n-Judgeの有効性を示す。
論文参考訳（メタデータ） (Sun, 03 Aug 2025 01:56:03 GMT)
「Bringing these capabilities together, we propose Refine-n-Judge, a fully automated dataset curation pipeline, summarized in Figure 2. In this framework, an LLM model serves as both the refiner- generating improved outputs- and the judge-comparing the refined output against the original and selecting the preferred version.」という高品質化フレームワークの提案。
judge 部分なしでは十分な効果がなかったという結果が興味深い。改善とは異なるタスクとしてjudge をLLMに解かせるというのが重要なんだろうか。

Self-Questioning Language Models

Self-Questioning Language Models [51.8]
本稿では,提案者がトピックを与えられ,解答者に対する質問を生成する非対称なセルフプレイフレームワークを提案する。提案者と解答者はともに強化学習を通じて訓練される。 3桁の乗算、OMEGAベンチマークの代数問題、Codeforcesのプログラミング問題である。
論文参考訳（メタデータ） (Tue, 05 Aug 2025 17:51:33 GMT)
「Our method leverages the intrinsic capabilities of large language models by casting them in dual roles of proposer and solver within an asymmetric self-play setup. By rewarding the generation of problems that are neither too easy nor too difficult, and by reinforcing answers via internal agreement or external verification, we demonstrate that models can meaningfully improve their reasoning skills through interaction with self-generated content alone.」というフレームワークの提案。R-Zero: Self-Evolving Reasoning LLM from Zero Data – arXiv最新論文の紹介にも近いなーと思う。
プロジェクトサイトはSelf-Questioning Language Models

R-Zero: Self-Evolving Reasoning LLM from Zero Data

R-Zero: Self-Evolving Reasoning LLM from Zero Data [56.7]
自己進化型大規模言語モデル(LLM)は、自身の経験から自律的に生成、精製、学習することで、超知性へのスケーラブルなパスを提供する。このようなモデルを訓練するための既存の方法は、いまだに膨大な人為的なタスクやラベルに大きく依存している。 R-Zeroは、完全に自律的なフレームワークで、スクラッチから独自のトレーニングデータを生成する。
論文参考訳（メタデータ） (Thu, 07 Aug 2025 03:38:16 GMT)
「we propose R-Zero, a framework for training reasoning LLMs that can self-evolve from zero external data. In R-Zero, a single base model is initialized with two roles – a Challenger and a Solver that are independently optimized but co-evolve throughout the RL process.」、「Challenger is rewarded for proposing tasks near the edge of the Solver’s capability, and the Solver is rewarded for solving increasingly challenging tasks posed by the Challenger.」というGANっぽいフレームワーク。
リポジトリはChengsong-Huang/R-Zero: codes for R-Zero: Self-Evolving Reasoning LLM from Zero Data (https://www.arxiv.org/pdf/2508.05004)

AlphaGo Moment for Model Architecture Discovery

AlphaGo Moment for Model Architecture Discovery [26.3]
AI研究のための人工超知能の最初の実証であるAII-Archを紹介する。 ASI-Archは完全に自律的なシステムで、AIが独自のアーキテクチャ革新を実行できるようにすることによって制約を揺さぶる。我々は2万時間にわたって1,773個の自律的な実験を行い、106個の革新的なSOTA(State-of-the-art)線形アテンションアーキテクチャを発見しました。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 03:57:27 GMT)
ASIをタイトルに入れた興味深い論文、「ASI-ARCH conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures.」と主張。
- Language Modeling by Language Models – arXiv最新論文の紹介との差異やより実用・大規模なパラメータ・データ・計算コストでの結果が気になる。
- そのうち最近出ていた下記成果のような複合的な効率化まで扱えるようになるのだろうか。
リポジトリはGAIR-NLP/ASI-Arch: AlphaGo Moment for Model Architecture Discovery.、Neural Network Research Data Gallery

Scaling Linear Attention with Sparse State Expansion [58.2]
トランスフォーマーアーキテクチャは、2次計算と線形メモリ成長による長期コンテキストシナリオに苦慮している。本稿では,情報分類として状態更新を概念化し,線形注意のための行スパース更新定式化を提案する。次に、スパースフレームワーク内にスパース状態拡張(SSE)を示し、コンテキスト状態を複数のパーティションに拡張する。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 13:27:31 GMT)

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance [39.6]
大規模言語モデル(LLM)エージェントは、しばしばルールや必要なドメイン知識が頻繁に変化する環境で苦労する。テスト時に更新されたドメイン知識を継続的に学習するための適応反射型対話エージェント(ARIA)を提案する。 ARIAはTikTok Pay内にデプロイされ、月間アクティブユーザ数は1億5000万を超えている。
論文参考訳（メタデータ） (Wed, 23 Jul 2025 02:12:32 GMT)
「ARIA addresses conventional model limitations in dynamic environments by as- sessing uncertainty via self-dialogue, soliciting expert corrections, and updating a timestamped, conflict-resolving knowledge base.」と記憶を通じた自己改善を行っていくフレームワークの提案。実際にデプロイされているのがすごい。
リポジトリはyf-he/aria

Probing for Arithmetic Errors in Language Models

Probing for Arithmetic Errors in Language Models [86.8]
言語モデルの内部アクティベーションは、算術誤差を検出するために使用できる。単純なプローブはモデルが予測した出力と正解の両方を隠蔽状態から正確に復号できることを示す。モデル精度を90%以上の精度で予測する軽量エラー検出器を訓練する。
論文参考訳（メタデータ） (Wed, 16 Jul 2025 16:27:50 GMT)
「Starting with a controlled set- ting of 3-digit addition, we show that simple probes can accurately decode both the model’s predicted output and the correct an- swer from hidden states, regardless of whether the model’s output is correct.」はまぁできるだろうとして、「We then extend this analysis to a more complex setting, where the model is asked to solve math word problems only requiring addition (Cobbe et al , 2021) using a structured chain-of-thought (CoT) format (Wei et al , 2022), in which intermediate steps are expressed as equations (e g , <a+b=c>). Remarkably, we find that the same probes trained on simple arithmetic queries can be applied directly to this setting, maintaining over 80% accuracy in detecting whether the model is producing correct intermediate results.」やself correlationに役立ったりは面白い結果。

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning [27.2]
SPIRALは、モデルをマルチターン、ゼロサムゲームで学習し、自身のバージョンを継続的に改善するセルフプレイフレームワークである。 SPIRALを用いることで、ゼロサムゲーム上でのセルフプレイは、広く移動する推論能力を生み出す。分析により, この伝達は, 系統的分解, 期待値計算, ケース・バイ・ケース分析という3つの認知的パターンを通じて起こることが明らかとなった。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 17:58:13 GMT)
人への依存を少なくするため「We introduce SPIRAL, a self-play framework where models learn by playing multi-turn, zero-sum games against continuously improving versions of themselves, eliminating the need for human supervision.」というフレームワークを提案、効果を確認とのこと。「Key Findings. Training on zero-sum games produces reasoning capabilities that transfer broadly.」としている。「Our empirical results show that training on Kuhn Poker alone improves mathematical reasoning by 8.7% average and Minerva Math by 18.1%, surpassing models trained on 25,000 expert demonstrations」とSFTを上回っているのは若干驚き。
リポジトリはGitHub – spiral-rl/spiral: SPIRAL: Self-Play on Zero-Sum Games Incentivizes Reasoning via Multi-Agent Multi-Turn Reinforcement Learning

AlphaEvolve: A coding agent for scientific and algorithmic discovery

AlphaEvolve: A coding agent for scientific and algorithmic discovery [63.1]
我々は,最先端LLMの能力を大幅に向上させる進化的符号化エージェントAlphaEvolveを提案する。 AlphaEvolveはLLMの自律パイプラインを編成し、そのタスクはコードを直接変更することでアルゴリズムを改善することである。本稿では,多くの重要な計算問題に適用することで,このアプローチの広範な適用性を実証する。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 06:37:18 GMT)
AlphaEvolve: A Gemini-powered coding agent for designing advanced algorithms – Google DeepMindの論文がarXivに出ていた

Agents of Change: Self-Evolving LLM Agents for Strategic Planning

Agents of Change: Self-Evolving LLM Agents for Strategic Planning [17.7]
我々は、シンプルなゲームプレイングエージェントから、自身のプロンプトとプレイヤーエージェントのコードを自動で書き直すことができるシステムまで、LSMベースのエージェントの進歩をベンチマークする。以上の結果から,特にClaude 3.7 や GPT-4o などのモデルによって駆動される自己進化型エージェントは,その戦略を自律的に採用することで,静的ベースラインを上回っていることがわかった。
論文参考訳（メタデータ） (Thu, 05 Jun 2025 05:45:24 GMT)
カタンの開拓者を対象として Self-Evolving Agent Frameworkの提案と検証。
「Through extensive experiments, we show that agents capable of prompt and code evolution achieve consistently higher performance than static baselines. The PromptEvolver, in particular, outperforms fixed agents across key metrics, and its gains are amplified when paired with stronger base models, seen in Claude 3.7’s 95% improvement from the BaseAgent」とのこと。PromptEvolverには「Evolver Agent: Provided with access to game results, evolution history, and tools to search the web, view local files, and edit the Player Agent’s prompt.」が含まれている。
プロンプトやコードといった思考能力たるWeight外のself-improveも十分効果的のよう。（ICLが有効と考えれば一定思考能力を改善しているともいえるのか・・・？）

Boosting LLM Reasoning via Spontaneous Self-Correction

Boosting LLM Reasoning via Spontaneous Self-Correction [43.5]
数学推論を改善するためのアプローチの1つは自己補正である。既存の自己補正アプローチは、修正を独立したポストジェネレーションとして扱う。本研究では,LLMが単一推論パスでインターリーブされた解と検証を生成できる自己補正手法であるSPOCを提案する。
論文参考訳（メタデータ） (Sat, 07 Jun 2025 21:23:00 GMT)
「we introduce SPOC, a spontaneous self-correction approach that enables LLMs to spontaneously generate interleaved solutions and verifications in a single inference pass.」とCoT（ToT）とLRMの関係を思い出すアプローチ。
この手の強化を行ったモデルをMoA的に束ねるのが良いのか、いろいろなものを一つのモデルが吸収していくのか、興味があるところ。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31