o1 – arXiv最新論文の紹介

DeepSeek-R1 Thoughtology: Let’s about LLM Reasoning

DeepSeek-R1 Thoughtology: Let’s <think> about LLM Reasoning [31.8]
本稿では,DeepSeek-R1の思考長,長期的・紛らわしい文脈の管理,文化的・安全性に関する影響と制御性について検討する。 DeepSeek-R1には、余分な推論時間によってモデルパフォーマンスが損なわれるような推論の‘スイートスポット’がある。また、DeepSeek-R1の安全性上の脆弱性は、非合理的な脆弱性と比べても大きい。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 00:36:08 GMT)
DeepSeek R1の推論に関する分析、「DeepSeek-R1 exhibits higher safety vulnerabilities compared to its non-reasoning counterpart DeepSeek-V3 (DeepSeek-AI et al , 2025b).　We also show that the model’s reasoning capabilities can be used to generate jailbreak attacks that successfully elicit harmful responses from safety-aligned LLMs.」、「When presented with moral or cultural questions, DeepSeek-R1 reasons for significantly longer when prompted in English than when prompted in Chinese. It also provides different responses, displaying different sets of cultural values in each language」は面白い。

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond [14.4]
私たちはまず、長いCOT能力を欠いたモデルから始まる、長いCOTモデルをスクラッチからトレーニングすることに重点を置いています。 Qwen2.5-32B-Instructから2段階のSFTとセミオン・ポリティクスDPOからなるカリキュラムトレーニングレシピを用いて、我々のモデルであるLight-R1-32Bをトレーニングする。 AIME24と25のスコアはそれぞれ74.0と60.2であり、Light-R1-14B-DSは32BモデルとDeepSeek-R1-Distill-Llama-70Bを抜いた。
論文参考訳（メタデータ） (Thu, 13 Mar 2025 15:29:22 GMT)
2ステージのSFT＋DPO Optimization（＋ model merge）で構築したモデル。「High-Quality Data is All You Need」の通りデータセット側のパイプラインも凝っている。他の研究成果でも近いことが指摘されているが「Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains.」は興味深い。
リポジトリはGitHub – Qihoo360/Light-R1

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.5]
近年の研究では、モデルをより長い思考の連鎖(CoTs)を通して考える時間を増やすことで、複雑な推論タスクにおいて大幅な改善が得られることが示されている。より長いCoTによるスケーリングが、特定のドメインにおけるLarge Language Model(LLM)の推論性能を損なうかどうかを考察する。
論文参考訳（メタデータ） (Tue, 25 Feb 2025 10:48:05 GMT)
十分なCoTを提供かつ長すぎるCoTが悪影響を与えないようにする「Thinking-OPtimal Scaling strategy (TOPS) that allows LLMs to decide by themselves how many tokens are needed to solve a given problem.」の提案
「Format Imitation enables the base model to learn how to adopt different levels of reasoning effort ei to perform System-2 thinking, using a small set of seed data. Reasoning Effort-Conditioned Generation requires the model to apply System-2 thinking to a large set of problems under different reasoning efforts. Self-Improvement select the shortest correct response for each problem among all responses to fine-tune the base model to achieve thinking-optimal test-time scaling.」という3ステージ構成。

Deepseek R1、Sky-T1、TinyZero、Kimi k1.5

先週も大きなニュースが多かった。特にDeepSeek R1は非常に高い性能のLarge Reasoning Modelであり、しかも、オープンなモデルであることが衝撃的だった。Deepseek R1 Zeroは強化学習によって性能を上げていることも特徴的である。Kimi k1.5も近い発想で構築されたモデルで強化学習の有効性を示しているように見える。

DeepSeek R1の過程で構築したデータを用いQwenやLlamaを強化したモデルも大きく性能を上げているのが驚き。蒸留が許可されているライセンスであり、合成データを構築する元モデルとしても有力そう。

o1ライクなオープンモデルとしてはSky-T1: Train your own O1 preview model within $450やGitHub – Jiayi-Pan/TinyZero（XユーザーのJiayi Panさん: 「We reproduced DeepSeek R1-Zero in the CountDown game, and it just works Through RL, the 3B base LM develops self-verification and search abilities all on its own You can experience the Ahah moment yourself for < $30 Code: https://t.co/B2IsN1PrXV Here’s what we learned 🧵 https://t.co/43BVYMmS8X」 / X）も興味深い。

それ以外にもOpenAI Operator（Introducing Operator research preview | OpenAI）はGUIエージェントの萌芽を感じさせる。

オープンモデルの盛り上がりの中、OpenAIがLLMコアだけではなく周辺領域に手を出そうとしているようにも見えて面白い。

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning [147.2]
第一世代の推論モデルであるDeepSeek-R1-ZeroとDeepSeek-R1を紹介します。 DeepSeek-R1-Zeroは大規模な強化学習を通じて訓練されている。 DeepSeek-R1は、RLの前にマルチステージトレーニングとコールドスタートデータを組み込んでいる。
論文参考訳（メタデータ） (Wed, 22 Jan 2025 15:19:35 GMT)

Kimi k1.5: Scaling Reinforcement Learning with LLMs [84.2]
我々は、強化学習で訓練された最新のマルチモーダル言語モデル、Kimi k1.5の訓練実践について報告する。長いコンテキストスケーリングと改善されたポリシー最適化手法が、我々のアプローチの鍵となる要素である。本システムは,複数のベンチマークやモダリティに対して,最先端の推論性能を実現する。
論文参考訳（メタデータ） (Wed, 22 Jan 2025 02:48:14 GMT)

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models

Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models [33.1]
大規模言語モデル(LLM)は、複雑な推論タスクに対処するためにそれらを活用することに大きな研究の関心を呼んだ。最近の研究は、LLMがテスト時間推論中により多くのトークンで”考える”ことを奨励することは、推論の精度を著しく向上させることを示した。 OpenAIのo1シリーズの導入は、この研究の方向性において重要なマイルストーンである。
論文参考訳（メタデータ） (Thu, 16 Jan 2025 17:37:58 GMT)
OpenAI o1ライクなモデル、Large Reasoning Modelsのサーベイ。「We begin by introducing the foundational background of LLMs and then explore the key technical components driving the development of large reasoning models, with a focus on automated data construction, learning-to-reason techniques, and test-time scaling.」とある通り包括的な内容。
下記でも思ったが本当に進展が速い

O1 Replication Journey — Part 3: Inference-time Scaling for Medical Reasoning [27.8]
この研究は、医学的推論タスクのための大規模言語モデル(LLM)における推論時間スケーリングの可能性を探るものである。 500サンプルを適度にトレーニングすることで,本モデルでは6%-11%の性能向上を実現した。
論文参考訳（メタデータ） (Sat, 11 Jan 2025 07:10:23 GMT)
プロジェクトサイトはGitHub – SPIRAL-MED/Ophiuchus

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

LlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs [103.0]
大規模言語モデルにおけるステップバイステップの視覚的推論を促進するための包括的フレームワークを提案する。マルチステップ推論タスクの評価に特化して設計された視覚推論ベンチマークを導入する。第二に,個々のステップの粒度で視覚的推論品質を評価する新しい指標を提案する。第3に、マルチステップのカリキュラム学習アプローチを用いて学習したLlamaV-o1という新しいマルチモーダル視覚推論モデルを提案する。
論文参考訳（メタデータ） (Fri, 10 Jan 2025 18:59:51 GMT)
マルチステップなVisual reasoningタスクのベンチマークVisual Reasoning-Chain (VRCBench)の提案とcurriculum learningを通してLlama-3.2-11B-Vision-Instruct を強化したモデルの構築。omkarthawakar/LlamaV-o1 · Hugging Face
商用モデルに近い性能を発揮。
プロジェクトサイトはLlamaV-o1: Rethinking Step-by-step Visual Reasoning in LLMs

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM

Virgo: A Preliminary Exploration on Reproducing o1-like MLLM [89.5]
スロー思考推論システムは、推論中の思考時間をスケールすることで、広く注目を集めている。マルチモーダル大規模言語モデル(MLLM)への適応にも関心が高まっている。本稿では,少量のテキスト長文思考データを用いて,有能なMLLMを微調整することで,簡単なアプローチを探索する。自然言語で表現されたこれらの長文推論プロセスは,MLLMに効果的に転送できることがわかった。
論文参考訳（メタデータ） (Fri, 03 Jan 2025 17:14:16 GMT)
o1-likeな推論に時間をかけるアプローチがMLLMにおいても有効であるとの報告。それはそうなんだろうと思うが、猛追という感じ。
リポジトリはGitHub – RUCAIBox/Virgo: Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [15.4]
本稿では,小型言語モデル (SLM) が OpenAI o1 の算術的推論能力に匹敵するか,超越するかを示すために rStar-Math を提案する。我々はモンテカルロ木探索(MCTS)を通して「深層思考」を実践し,SLMに基づくプロセス報酬モデルによるテスト時間探索を行う。
論文参考訳（メタデータ） (Wed, 08 Jan 2025 14:12:57 GMT)
「In this work, we present rStar-Math, a self-evolved System 2 deep thinking approach that significantly boosts the math reasoning capabilities of small LLMs, achieving state-of-the-art OpenAI o1-level performance.」と流行りのアプローチ、self-evolvedという表現に未来を感じるとともに、比較的小規模なモデルでも高いスコアをとれていることが興味深い
リポジトリはhttps://github.com/microsoft/rStar。現時点では404？

Search-o1: Agentic Search-Enhanced Large Reasoning Models

Search-o1: Agentic Search-Enhanced Large Reasoning Models [24.2]
OpenAI-o1のような大きな推論モデル(LRM)は、大規模な強化学習を通じて、大きなステップワイズ推論能力を実証している。エージェント検索拡張生成(RAG)機構とReason-in-Documentsモジュールを併用し,LRMを強化するフレームワークである textbfSearch-o1 を紹介する。
論文参考訳（メタデータ） (Thu, 09 Jan 2025 16:48:17 GMT)
RAG + Large Rrasoning Modelなフレームワークの提案。Agenticなアプローチに見えなくもないが、「(a) Direct reasoning without retrieval often results in inaccuracies due to missing knowledge. (b) Our agentic retrieval-augmented reasoning approach improves knowledge access but usually returns lengthy, redundant documents, disrupting coherent reasoning. (c) Our Search-o1 integrates concise and accurate retrieved knowledge seamlessly into the reasoning process, enabling precise and coherent problem-solving.」とReason-in-Documentsを用いLRMと別の処理として推論の流れに沿った情報を選択・要約してLRMに組み込む有効性を主張している。
リポジトリはSearch-o1: Agentic Search-Enhanced Large Reasoning Models

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs [76.4]
o1のようなモデルは、推論中に人間のような長時間の思考をエミュレートすることができる。本論文は,これらのモデルにおける過度な考察の課題に関する,最初の包括的研究である。精度を損なうことなく、過剰思考を緩和し、推論プロセスを合理化するための戦略を提案する。
論文参考訳（メタデータ） (Mon, 30 Dec 2024 18:55:12 GMT)
「This paper presents the first comprehensive study on the prevalent issue of overthinking in these models, where excessive computational resources are allocated for simple problems with minimal benefit.」とoverthinkingに焦点を当てた興味深い論文。

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31