強化学習 – arXiv最新論文の紹介

Robust Reward Modeling via Causal Rubrics

Robust Reward Modeling via Causal Rubrics [46.4]
リワードモデル(RM)は、人間のフィードバックによってLLM(Large Language Models)を整列させるのに基本的だが、報酬のハッキングに悩まされることが多い。 Cromeは、報酬のハッキングを軽減するために設計された明確な因果モデルに基づく、新しいフレームワークである。 RewardBenchの標準ベースラインを大幅に上回り、平均精度を最大5.4%向上させ、特定のカテゴリーで最大13.2%と7.2%のゲインを達成した。
論文参考訳（メタデータ） (Thu, 19 Jun 2025 17:59:47 GMT)
rewardハッキングへ対応可能な因果性を利用したフレームワーク、Crome (Causally Robust Reward Modeling)の提案
Google Deepmindによる成果だがChromeと紛らわしいような・・・

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy

AceReason-Nemotron 1.1: Advancing Math and Code Reasoning through SFT and RL Synergy [48.3]
強い推論モデルの開発において,教師付き微調整(SFT)と強化学習(RL)の相乗効果について検討した。スケーリング戦略は推理性能に顕著な改善をもたらします我々のAceReason-Nemotron-1.1 7Bモデルは、Qwen2.5-7Bに基づく推論モデルにおいて、AceReason-Nemotron-1.0と新しい最先端性能を著しく上回っている。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 09:27:48 GMT)
LRM開発において重要なSFTとRLの関係を検証した論文。「Our results show that both scaling strategies substantially improve the reasoning abilities of large language models (LLMs).」とのこと。
「Interestingly, even strong SFT models with robust coding abilities benefit substantially from math-only RL training. This leads to further gains in coding performance.」のように隣接領域（？）での性能向上は、この分野だと色々なところで見られて興味深い性質だと思っている。
リポジトリはnvidia/AceReason-Nemotron-1.1-7B · Hugging Face

Spurious Rewards: Rethinking Training Signals in RLVR

Spurious Rewards: Rethinking Training Signals in RLVR [130.3]
検証可能な報酬(RLVR)を用いた強化学習は,特定のモデルにおいて強い数学的推論を導出できることを示す。例えば、RLVRはQwen2.5-Math-7BのMATH-500の性能を21.4%向上させた。コード推論 — 実際のコード実行なしにコードで考える — は、RLVR以降、はるかに頻繁になる、独特なQwen2.5-Mathの振る舞いである。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 17:49:55 GMT)
「We show that reinforcement learning with verifiable rewards (RLVR) can elicit strong mathematical reasoning in certain models even with spurious rewards that have little, no, or even negative correlation with the correct answer. For example, RLVR improves MATH-500 performance for Qwen2.5-Math-7B in abso- lute points by 21.4% (random reward), 13.8% (format reward), 24.1% (incorrect label), 26.0% (1-shot RL), and 27.1% (majority voting)—nearly matching the 29.1% gained with ground truth rewards.」という直観に反する結果の報告と検証。
「Our findings have three main implications: base model pretraining significantly affects RLVR outcomes; even corrupted or spurious supervision can enhance reasoning when it triggers useful existing behaviors; and effects observed in one model family may not generalize to others. Our work highlights the importance of (1) testing across multiple models with differing pretraining distributions, and (2) testing across multiple different baselines, such as format and random rewards, when evaluating reinforcement learning techniques.」としている。モデルに依存し、結果が間違っていても一定効果があるのは本当に面白い。内部知識とそれを引き出すテクニックの間にはいまだギャップがあるということだろうか。。
リポジトリはGitHub – ruixin31/Spurious_Rewards、https://rethink-rlvr.notion.site/Spurious-Rewards-Rethinking-Training-Signals-in-RLVR-1f4df34dac1880948858f95aeb88872f?pvs=4にBlog記事もある。

Self-Adapting Language Models

Self-Adapting Language Models [44.5]
大規模言語モデル(LLM)は強力だが静的であり、新しいタスクや知識、例に対応して重みを適応するメカニズムが欠如している。我々は,自己適応型LSM(Self-Adapting LLMs, SEAL)を導入する。知識の定式化と数ショットの一般化の実験により、SEALは自己指向適応が可能な言語モデルに向けた有望なステップであることが示された。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 17:48:13 GMT)
「We propose Self-Adapting LLMs (SEAL), a framework that enables language models to improve themselves by generating their own synthetic data and optimization parameters (“self-edits”) in re- sponse to new data. The model is trained to produce these self-edits directly through token generation with the data provided in the model’s context. Self-edit generation is learned via reinforcement learning (RL) where the model is rewarded for generating self-edits (SE) that, when applied, improve the model’s performance at the target task.」という自己適合、自己進化、自己改善のアプローチ。SQuADやARC-AGI benchmark（のサブセット）を用いて効果を検証している。
合成データを介しての自己改善はやはり有効そうという印象。（今でも一定実用的であると思うが）AGIとかいう世界観を考えると時間的制約が解消できるかがポイントだろうか。（AIにも睡眠が必要と言いつつこの手の処理を行うような少し未来が妄想される）
プロジェクトサイトはSelf-Adapting Language Models

Self-Adapting Improvement Loops for Robotic Learning [30.8]
専門家によるデモンストレーションで訓練されたビデオ生成モデルは、ロボットタスクを解くためのパフォーマンスの高いテキスト条件付きビジュアルプランナーとして利用されてきた。本研究では,自己生成トラジェクトリ上で,ドメイン内ビデオモデルを反復的に更新する自己改善ループ(SAIL)を提案する。従来のドメイン内ビデオモデルトレーニングでは,新規タスクの繰り返しに対して,パフォーマンスが継続的に向上することが確認できた。
論文参考訳（メタデータ） (Sat, 07 Jun 2025 04:34:37 GMT)
「we highlight that adaptation with large-scale pretrained text-conditioned video models is critical for facilitating self-improvement, by contributing text-conditioned generalization capabilities and motion priors.」とこちらは動画生成モデルを活用するアプローチ。
プロジェクトサイトはSAIL

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning

Point-RFT: Improving Multimodal Reasoning with Visually Grounded Reinforcement Finetuning [122.8]
我々は、視覚的文書理解のために、視覚的に基底付けられたCoT推論を利用するように設計されたマルチモーダル推論フレームワークであるPoint-RFTを紹介した。提案手法は2つの段階から構成される: まず、71Kの多様な視覚的推論問題からなるキュレートされたデータセットを用いてフォーマットの微調整を行い、それぞれが対応する視覚的要素に明示的に基づいた詳細なステップ・バイ・ステップの合理性でアノテートする。 ChartQAでは,テキストベースCoTのみに依存した強化微調整による精度83.92%を超え,精度を70.88%(言語微細化ベースライン)から90.04%に向上させる。
論文参考訳（メタデータ） (Mon, 26 May 2025 08:54:14 GMT)
MLLMに対するPost training、マルチモーダルなLRM化につながる成果

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning

J1: Incentivizing Thinking in LLM-as-a-Judge via Reinforcement Learning [69.1]
このようなモデルをトレーニングするための強化学習アプローチであるJ1を紹介する。本手法は,判断バイアスを軽減し,思考にインセンティブを与える検証可能な報酬を用いて,検証可能なプロンプトと検証不可能なプロンプトの両方を判断タスクに変換する。評価基準を概説し、自己生成した基準回答と比較し、モデル応答の正しさを再評価することにより、モデルがより良い判断を下すことが判明した。
論文参考訳（メタデータ） (Thu, 15 May 2025 14:05:15 GMT)
Thinking-LLM-as-a-Judge modelsを構築するための強化学習レシピの提案。
「our approach outperforms all other existing 8B or 70B models when trained at those sizes, including models distilled from DeepSeek-R1. J1 also outperforms o1-mini, and even R1 on some benchmarks, despite training a smaller model.」とのこと。
Assessing Judging Bias in Large Reasoning Models: An Empirical Study – arXiv最新論文の紹介など、LLM as a judgeなタスクでのLRM適用に効果があるという指摘はあったのでそれらと整合的な結果であるように思う。

Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning [93.3]
DeepSeek-R1同様の学習パラダイムを用いた一連のツール利用言語モデルを開発した。 Nemotron-Research-Tool-N1は、ツール呼び出しの構造的妥当性と機能的正確性のみを評価するバイナリ報酬で最適化されている。実験により、Qwen-2.5-7B/14B-Instruct上に構築されたNemotron-Research-Tool-N1-7BとNemotron-Research-Tool-N1-14Bが最先端の結果を得ることが示された。
論文参考訳（メタデータ） (Fri, 25 Apr 2025 02:55:21 GMT)
「We introduces Nemotron-Research-Tool-N1, a series of tool-using language models trained with a rule-based reinforcement learning.」とルールベースの強化学習の有効性を確認した報告。
リポジトリはGitHub – NVlabs/Tool-N1

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective

A Survey on Data-Centric AI: Tabular Learning from Reinforcement Learning and Generative AI Perspective [23.3]
タブラルデータ(Tabular data)は、バイオインフォマティクス、医療、マーケティングなど、さまざまな領域で広く使われているデータフォーマットの1つである。本調査では,データ空間を精製するための基本技術として,強化学習(RL)と特徴選択と特徴生成のための生成的アプローチについて検討する。我々は,既存の課題を要約し,今後の研究の方向性について論じ,この分野の継続的なイノベーションを促進する洞察を提供することを目的とする。
論文参考訳（メタデータ） (Wed, 12 Feb 2025 22:34:50 GMT)
「Tabular data-centric AI is evolving with RL-based optimization and generative modeling playing a key role in feature engineering.」とのこと。現状でも重要性が下がっていないテーブルデータに対してRL系の最適化や生成AI活用などをサーベイした論文。

不均衡データに対するサーベイも出ていた。こちらも過去から重要な視点。

A Comprehensive Survey on Imbalanced Data Learning [45.3]
不均衡なデータは、さまざまな種類の生データに広まっており、機械学習のパフォーマンスを妨げる。本調査は,様々な実世界のデータ形式を体系的に分析する。さまざまなデータフォーマットに関する既存の研究は、データ再バランス、特徴表現、トレーニング戦略、アンサンブル学習の4つのカテゴリにまとめられている。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 04:53:17 GMT)

Teaching Language Models to Critique via Reinforcement Learning

Teaching Language Models to Critique via Reinforcement Learning [59.4]
我々は、CTRLでトレーニングされた批評家が、パスレートを大幅に向上し、ベースモデルとより強力なジェネレータモデルの両方でエラーを軽減することを示した。また、これらの批判モデルが正確な生成報酬モデルとして機能し、反復的批評・修正によるテストタイムスケーリングを可能にすることを示す。
論文参考訳（メタデータ） (Wed, 05 Feb 2025 02:18:46 GMT)
「two-stage training approach: (1) synthesizing high-quality critiques by reasoning about execution feedback, then (2) refining the critic through reinforcement learning.」という2ステージ構成、強化学習（GRPO）を活用したcriticモデルの構築。
プロジェクトサイトはCTRL: Critic Training via Reinforcement Learning

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training

SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training [127.5]
ファウンデーションモデルでは、教師付き微調整(SFT)と強化学習(RL)がポストトレーニング技術として広く使われている。本稿では,一般化と記憶におけるSFTとRLの違いについて検討する。 RLは、特に結果に基づく報酬で訓練された場合、ルールベースのテキストと視覚的バリエーションの両方で一般化されることを示す。
論文参考訳（メタデータ） (Tue, 28 Jan 2025 18:59:44 GMT)
まさに今知りたい情報という感じの論文、「Through extensive experiments on the GeneralPoints and V-IRL tasks, we demonstrated that RL exhibits superior performance in learning generalizable knowledge, while SFT tends to merely memorize the training data, across both the rule and visual variations.」とのこと。
上記に加え、「SFT is necessary for RL training when the backbone model does not follow instructions.」はとても興味深い。基礎性能によって効果的なトレーニング方針が異なるというのは他の事例でもよく見られる印象があり（直感的にもそうだろうとも思い）、このあたりは重要なノウハウでありそう。
プロジェクトサイトはSFT Memorizes, RL Generalizes

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31