Easy2Hard – arXiv最新論文の紹介

Empowering LLMs in Decision Games through Algorithmic Data Synthesis

Empowering LLMs in Decision Games through Algorithmic Data Synthesis [29.1]
意思決定ゲームは、大規模言語モデルの推論能力を評価し、強化するための理想的なサンドボックスとして機能する。データ合成戦略を設計し、2つの古典ゲーム、DoudizhuとGoから広範囲のオフラインデータセットをキュレートする。我々は、このデータをLLMトレーニングに効果的に組み込むための一連の技術を開発し、その結果、Mastermind-Dou と Mastermind-Go という2つの新しいエージェントを生み出した。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 07:30:29 GMT)
一般的に数学やコード生成を対象にLRM化が行われているがこの論文では「Through a suite of our designed techniques in data collection and training, we have developed MasterMind agents, demonstrating commendable performance in both Doudizhu and Go.」とゲームが対象。「Empirical experiments also serve to substantiate the potential of this approach in improving general reasoning capabilities of LLMs.」というのがとても興味深い。人間でいうところの「脳によい〇〇」的なタスクがあるのだろうか。（もっとも性能が落ちるタスクがあることも指摘されているが・・・）
データセットが公開されている。OpenDILabCommunity/MasterMind · Datasets at Hugging Face

Easy2Hard-Bench: Standardized Difficulty Labels for Profiling LLM Performance and Generalization [126.3]
さまざまなドメインにまたがる6つのベンチマークデータセットのコレクションであるEasy2Hard-Benchを紹介します。これらのデータセット内の各問題は、数値的な難易度スコアで注釈付けされる。様々な難易度にまたがる性能と一般化能力を総合的に分析する。
論文参考訳（メタデータ） (Fri, 27 Sep 2024 03:49:56 GMT)
「While generalization over tasks from easy to hard is crucial to profile language models (LLMs), the datasets with fine-grained difficulty annotations for each problem across a broad range of complexity are still blank.」とのことで作られたデータセット。面白い傾向が出ている一方でLLMのベンチマークで主要な対象にされているところは難易度を分けるのにも苦労しそうな印象がある。
リポジトリはfuronghuang-lab/Easy2Hard-Bench · Datasets at Hugging Face

Easy-to-Hard Generalization: Scalable Alignment Beyond Human Supervision [99.0]
現在のAIアライメント手法は、人間が提供する実演や判断に依存している。彼らの能力が人間のレベルを超えたとき、システムを改善するにはどうすればよいのか?
論文参考訳（メタデータ） (Thu, 14 Mar 2024 15:12:38 GMT)
The Unreasonable Effectiveness of Easy Training Data for Hard Tasks – arXiv最新論文の紹介 (devneko.jp)でも取り上げられていた話だが、PRMs(process reward models)やOPRMs(Outcome & Process Reward Model)を用いるとさらに有効とのこと。
AGIやASIという話を聞くにこのような手法の重要性が高まっているように思う（一方で結論にある「This approach presents a promising direction for developing AI systems capable of surpassing human problem-solving capabilities」のように人間がEasy側に位置づけられるのは複雑な思いもある）
リポジトリはEdward-Sun/easy-to-hard (github.com)

The Unreasonable Effectiveness of Easy Training Data for Hard Tasks [92.1]
現在の言語モデルは、ハードデータで訓練されたモデルと同様に、比較的容易にハードデータから一般化されることが多い。ハードデータ上でモデルパフォーマンスを最も気にしている場合でも、ハードデータよりも簡単なデータを収集してトレーニングする方がよいことを示す。
論文参考訳（メタデータ） (Fri, 12 Jan 2024 18:36:29 GMT)
易しい問題でチューニングしたモデルが難しい問題に対してもかなり有効であるとの報告。とっても面白い性質。
「Our findings suggest that the scalable oversight problem may be easier than previously thought.」とあるものの意図せず、強力なものを作ってしまう危険性もあるような。。（参考：Fugu-MT 論文翻訳(概要): Measuring Progress on Scalable Oversight for Large Language Models (fugumt.com)）
リポジトリはallenai/easy-to-hard-generalization: Code for the arXiv preprint “The Unreasonable Effectiveness of Easy Training Data” (github.com)