2025年6月16日 – arXiv最新論文の紹介

Spurious Rewards: Rethinking Training Signals in RLVR

Spurious Rewards: Rethinking Training Signals in RLVR [130.3]
検証可能な報酬(RLVR)を用いた強化学習は,特定のモデルにおいて強い数学的推論を導出できることを示す。例えば、RLVRはQwen2.5-Math-7BのMATH-500の性能を21.4%向上させた。コード推論 — 実際のコード実行なしにコードで考える — は、RLVR以降、はるかに頻繁になる、独特なQwen2.5-Mathの振る舞いである。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 17:49:55 GMT)
「We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in abso- lute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting)—nearly matching the 29.1% gained with ground truth rewards.」という直観に反する結果の報告と検証。
「Our findings have three main implications: base model pretraining significantly affects RLVR outcomes; even corrupted or spurious supervision can enhance reasoning when it triggers useful existing behaviors; and effects observed in one model family may not generalize to others. Our work highlights the importance of (1) testing across multiple models with differing pretraining distributions, and (2) testing across multiple different baselines, such as format and random rewards, when evaluating reinforcement learning techniques.」としている。モデルに依存し、結果が間違っていても一定効果があるのは本当に面白い。内部知識とそれを引き出すテクニックの間にはいまだギャップがあるということだろうか。。
リポジトリはGitHub – ruixin31/Spurious_Rewards、https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f?pvs=4にBlog記事もある。

Self-Adapting Language Models

Self-Adapting Language Models [44.5]
大規模言語モデル(LLM)は強力だが静的であり、新しいタスクや知識、例に対応して重みを適応するメカニズムが欠如している。我々は,自己適応型LSM(Self-Adapting LLMs, SEAL)を導入する。知識の定式化と数ショットの一般化の実験により、SEALは自己指向適応が可能な言語モデルに向けた有望なステップであることが示された。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 17:48:13 GMT)
「We propose Self-Adapting LLMs (SEAL), a framework that enables language models to improve themselves by generating their own synthetic data and optimization parameters (“self-edits”) in re- sponse to new data. The model is trained to produce these self-edits directly through token generation with the data provided in the model’s context. Self-edit generation is learned via reinforcement learning (RL) where the model is rewarded for generating self-edits (SE) that, when applied, improve the model’s performance at the target task.」という自己適合、自己進化、自己改善のアプローチ。SQuADやARC-AGI benchmark（のサブセット）を用いて効果を検証している。
合成データを介しての自己改善はやはり有効そうという印象。（今でも一定実用的であると思うが）AGIとかいう世界観を考えると時間的制約が解消できるかがポイントだろうか。（AIにも睡眠が必要と言いつつこの手の処理を行うような少し未来が妄想される）
プロジェクトサイトはSelf-Adapting Language Models

Self-Adapting Improvement Loops for Robotic Learning [30.8]
専門家によるデモンストレーションで訓練されたビデオ生成モデルは、ロボットタスクを解くためのパフォーマンスの高いテキスト条件付きビジュアルプランナーとして利用されてきた。本研究では,自己生成トラジェクトリ上で,ドメイン内ビデオモデルを反復的に更新する自己改善ループ(SAIL)を提案する。従来のドメイン内ビデオモデルトレーニングでは,新規タスクの繰り返しに対して,パフォーマンスが継続的に向上することが確認できた。
論文参考訳（メタデータ） (Sat, 07 Jun 2025 04:34:37 GMT)
「we highlight that adaptation with large-scale pretrained text-conditioned video models is critical for facilitating self-improvement, by contributing text-conditioned generalization capabilities and motion priors.」とこちらは動画生成モデルを活用するアプローチ。
プロジェクトサイトはSAIL

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity

The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [16.3]
大規模な推論モデルは、回答を提供する前に詳細な思考プロセスを生成する。我々は, LRM がある種の複雑さを超えて完全に精度の低下に直面していることを示す。また、より深く推論の痕跡を調べ、探索された解のパターンを研究する。
論文参考訳（メタデータ） (Sat, 07 Jun 2025 22:42:29 GMT)
LRMに対する分析。「Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter- intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.」とのこと。
面白い検証結果。とはいえ、このような劣化はLLMの計算能力などでも指摘されてきた印象がある。直観的には現状のLLM/LRMはメタな解放に行きつけないという印象を持つが、コード生成などツール活用すれば多分解けるレベルであろうし解釈は悩ましいところ。
「We identified three distinct reasoning regimes: standard LLMs outperform LRMs at low complexity, LRMs excel at moderate complexity, and both collapse at high complexity.」は今の感覚としてはそうだろうと思う。
賛否はあるだろうが、下記のようにAnthropicのC. Opusから反論が来ているのが面白い。

Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [0.0]
大規模推論モデル(LRM)は、特定の複雑性しきい値を超えた計画パズルについて「精度の崩壊」を示す。これらの結果は,基本的推論失敗ではなく,実験的な設計上の制約を主に反映していることが実証された。
論文参考訳（メタデータ） (Tue, 10 Jun 2025 21:16:53 GMT)
1st authorがAnthropicのC. Opus、Acknowledgmentsに「We thank Ryan Greenblatt, o3, Gemini 2.5, and all of the people who pointed out the parentheses mismatch in an earlier draft for helpful comments」と書かれている。

Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks

Exploring Consciousness in LLMs: A Systematic Survey of Theories, Implementations, and Frontier Risks [46.9]
意識は人間の心の最も深い特徴の1つである。大規模言語モデル(LLM)が前例のないペースで発展するにつれ、知性と意識に関する疑問がますます重要になっている。
論文参考訳（メタデータ） (Mon, 26 May 2025 10:40:52 GMT)
「we systematically organize and synthesize existing research on LLM consciousness from both theoretical and empirical perspectives. Furthermore, we highlight potential frontier risks that conscious LLMs might introduce.」と意識に関するサーベイ。
リポジトリがあり、論文リストが参考になる　GitHub – OpenCausaLab/Awesome-LLM-Consciousness

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge

LLMs Cannot Reliably Judge (Yet?): A Comprehensive Assessment on the Robustness of LLM-as-a-Judge [44.6]
大規模言語モデル(LLM)は、様々なタスクにまたがる顕著な知性を示してきた。これらのシステムは、評価結果を操作できる敵攻撃の影響を受けやすい。 LLMに基づく審査員による既存の評価手法は、しばしば断片的であり、包括的な評価のための統一された枠組みが欠如している。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 06:48:57 GMT)
「This work presents the first scalable and fully automated framework to evaluate the robustness and reliability of LLM-as-a-Judge systems across multiple attack scenarios. We systematically benchmarked state-of-the-art LLM-based evaluators under various adversarial settings and found that they are vulnerable to manipulation, often producing biased or incorrect judgments when exposed to crafted inputs.」とのこと。LLM-as-a-Judgeシステムの堅牢性を体系的に評価するために設計されたRobustJudgeというフレームワークで評価を行っている。
リポジトリはGitHub – S3IC-Lab/RobustJudge

Magistral

Magistral [101.5]
私たちは、Mistralの最初の推論モデルであるMagistralと、当社独自のスケーラブルな強化学習パイプラインを紹介します。テキストデータだけでRLが初期チェックポイントの能力のほとんどを維持していることを示す。我々は、Mistral Medium 3上でRL単独で推論するために訓練されたMagistral Mediumを紹介し、Magistral Small(Apache 2.0)をオープンソース化した。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 17:22:37 GMT)
MistralのLRM、「Eating the multimodal free lunch」は面白い。
24BのモデルはApache2ライセンスで公開されている。mistralai/Magistral-Small-2506 · Hugging Face

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30