LRM – ページ 4 – arXiv最新論文の紹介

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond [88.6]
大規模推論モデル (LRM) は, 推論中におけるチェーン・オブ・ソート (CoT) の推論長を拡大することにより, 高い性能向上を示した。懸念が高まっているのは、過度に長い推論の痕跡を生み出す傾向にある。この非効率性は、トレーニング、推論、現実のデプロイメントに重大な課題をもたらす。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 15:36:30 GMT)
「In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm.」というサーベイ。Fugu-MT 論文翻訳(概要): Stop Overthinking: A Survey on Efficient Reasoning for Large Language Modelsでも思ったが新たな手法→新たな課題→包括的サーベイという流れが極めて速い。
リポジトリはGitHub – XiaoYee/Awesome_Efficient_LRM_Reasoning: A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities [101.8]
近年のLRM(Large Reasoning Models)の進歩は、特殊推論タスクにおいて顕著な性能を示している。議論的推論能力の獲得は, LRMの基礎的能力を大幅に低下させることを示す。適応推論(Zero-Thinking, Less-Thinking, Summary-Thinking)がこれらの欠点を効果的に軽減できることを示します。
論文参考訳（メタデータ） (Sun, 23 Mar 2025 08:18:51 GMT)
「The overall results of different LRMs under the Zero-Thinking, Summary-Thinking and Summary-Thinking-Plus mode for the evaluation of foundational capabilities.」の表5の結果が非常に興味深い。推論にパワーをかければよいというわけでもなく適応型戦略の重要性がよくわかる。
リポジトリはGitHub – SCIR-SC-Qiaoban-Team/FreeEvalLM

Cosmos World Foundation Model Platform for Physical AI

Cosmos World Foundation Model Platform for Physical AI [136.1]
私たちは、開発者が物理AIセットアップのためにカスタマイズされた世界モデルを構築するのを助けるために、Cosmos World Foundation Model Platformを紹介します。我々のプラットフォームは、ビデオキュレーションパイプライン、事前訓練された世界ファンデーションモデル、事前訓練された世界ファンデーションモデルのポストトレーニング例、ビデオトークン化ツールをカバーしています。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 16:59:07 GMT)
物理世界の理解と推論のためのマルチモーダルモデル、Cosmos-Reason1の提案。「In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e g , next step action) in natural language through long chain-of-thought reasoning processes.」「With Physical AI SFT and RL, Cosmos-Reason1 can learn intuitive physics, such as the arrow of time and object permanence, which existing models struggle with.」とCoTなLRMに似た構成。確かにこの分野に対してReasoning modelは有効そう。
リポジトリはGitHub – nvidia-cosmos/cosmos-reason1: Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control [98.2]
複数の空間制御入力に基づいて世界シミュレーションを生成する条件付き世界生成モデルであるCosmos-Transferを導入する。提案したモデルを解析し,ロボット2Realや自律走行車データ豊かさを含む物理AIへの応用を実証するために評価を行う。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 17:57:54 GMT)
こちらも注目の「diffusion-based conditional world model for multimodal controllable world generation」
リポジトリはGitHub – nvidia-cosmos/cosmos-transfer1: Cosmos-Transfer1 is a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environments.

Mistral Small 3.1, Hunyuan-T1

週刊LLM、LRMという感じだが、先週も話題は多かった。Mistral Small 3.1 | Mistral AIは公開モデルでGemma 3などと競合する性能を主張。NVIDIAのllama-3.3-nemotron-super-49b-v1 Model by NVIDIA | NVIDIA NIMは高効率化の件で興味深い結果を出していた。

Tencentからは事前アナウンスの通りMamba hybridなLRM、Hunyuan-T1が発表された（腾讯混元、Hunyuan T1 – a Hugging Face Space by tencent、llm.hunyuan.T1）。Deepseek R1やo1と比べても十分な性能に見える。

AntropicからWeb検索との連動（Claude can now search the web \ Anthropic）、OpenAIからは新たな音声関連モデルが発表される（Introducing next-generation audio models in the API | OpenAI, OpenAI.fm）など、ビジネス上はLLM・LRMの提供だけでなく周辺領域を埋めていくことが重要になりそう。

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [11.3]
ロングチェーン・オブ・ソート(Long CoT)特性は推論能力を高め、複雑な問題の解決を可能にする。まず、Long CoTとShort CoTを区別し、現在の推論パラダイムを分類する新しい分類法を導入する。次に,Long CoTの出現やオーバー思考,テストタイムスケーリングなど,これらの特徴について考察する。
論文参考訳（メタデータ） (Wed, 12 Mar 2025 17:35:03 GMT)
LRMでキーとなっているLong Chain of thoughtのサーベイ。「We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms.」と（通常の）Short CoTと Long CoTを分けている。
リポジトリはTowards Reasoning Era: A Survey of Long Chain-of-Thought

LLM Post-Training: A Deep Dive into Reasoning Large Language Models

LLM Post-Training: A Deep Dive into Reasoning Large Language Models [131.1]
大規模言語モデル (LLMs) は自然言語処理の状況を変え、多様な応用をもたらした。ポストトレーニング手法により、LLMは知識を洗練させ、推論を改善し、事実の正確性を高め、ユーザの意図や倫理的配慮をより効果的に整合させることができる。
論文参考訳（メタデータ） (Fri, 28 Feb 2025 18:59:54 GMT)
LRMでも注目されるPost training関連のサーベイ、Fine-tuning, Reinforcement Learning, Test-time Scalingが大きなキーワード。
リポジトリはGitHub – mbzuai-oryx/Awesome-LLM-Post-training: Awesome Reasoning LLM Tutorial/Survey/Guide

An Empirical Study on Eliciting and Improving R1-like Reasoning Models

An Empirical Study on Eliciting and Improving R1-like Reasoning Models [90.5]
RLトレーニングのスケーリングは、そのような推論モデルを実装するための中心的なテクニックとなっている。我々のRLトレーニングアプローチはQwen2.5-32Bベースモデルを継続的に改善することを示した。また、ツール操作の利用についても検討し、大きな推論モデルの推論性能を大幅に向上させることを見出した。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 15:34:27 GMT)
様々な研究機関が取り組むR1 like（o1 like）なモデル開発のテクニカルレポート。「By effectively utilizing tool manipulation, STILL-3-TOOL-32B achieves an impressive accuracy of 86.67 (greedy search) on AIME 2024. Remarkably, this ability can be activated with only a small number of high-quality training instances 」というのは面白く、ツールの利用にも拡張が進みつつあるよう。
リポジトリはGitHub – RUCAIBox/Slow_Thinking_with_LLMs: A series of technical report on Slow Thinking with LLM

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning

Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.5]
近年の研究では、モデルをより長い思考の連鎖(CoTs)を通して考える時間を増やすことで、複雑な推論タスクにおいて大幅な改善が得られることが示されている。より長いCoTによるスケーリングが、特定のドメインにおけるLarge Language Model(LLM)の推論性能を損なうかどうかを考察する。
論文参考訳（メタデータ） (Tue, 25 Feb 2025 10:48:05 GMT)
十分なCoTを提供かつ長すぎるCoTが悪影響を与えないようにする「Thinking-OPtimal Scaling strategy (TOPS) that allows LLMs to decide by themselves how many tokens are needed to solve a given problem.」の提案
「Format Imitation enables the base model to learn how to adopt different levels of reasoning effort ei to perform System-2 thinking, using a small set of seed data. Reasoning Effort-Conditioned Generation requires the model to apply System-2 thinking to a large set of problems under different reasoning efforts. Self-Improvement select the shortest correct response for each problem among all responses to fine-tune the base model to achieve thinking-optimal test-time scaling.」という3ステージ構成。

Claude 3.7, GPT-4.5, Phi-4, Selene

先週も大きなニュースが多く、AnthropicのClaude 3.7 sonnet、OpenAIのGPT-4.5などフラグシップと呼べるモデルの発表が相次いだ。

Claude 3.7はLLM&LRMというようなモデルでコード生成で高い性能を発揮している。Claude 3.7 Sonnet and Claude Code \ Anthropic

GPT-4.5は巨大・高性能なLLMという印象GPT-4.5 が登場 | OpenAI。LRMでは解きにくい領域ではとても有効そう。ベンチマーク個別では同じLLMのDeepseek V3に負けているものがあり（GitHub – deepseek-ai/DeepSeek-V3のAIME 2024やSWE Verified）、OpenAI一強時代の終わりを感じさせる結果になっている。

このような中、MicrosoftのPhi-4シリーズでも新たなモデルが公開されているWelcome to the new Phi-4 models – Microsoft Phi-4-mini & Phi-4-multimodal。小型モデルでも十分な性能が出ているように見える。

Frontier AI needs frontier evaluators. Meet Selene.など、強力なevaluatorなどLLMやLRMを補完する動きも興味深い。

LLM, LRM, SLMやチューニング、ハイブリッド構成など様々なアプローチがあり、モデルの選択肢も増え、何を選択していくべきか悩む時代になったのかなという印象。

Atla Selene Mini: A General Purpose Evaluation Model [2.9]
我々はSLMJ(Small-as-a-judge)の最先端の小型言語であるAtla Selene Miniを紹介した。 Selene Miniは、全体的なパフォーマンスにおいて最高のSLMJとGPT-4o-miniより優れた汎用評価器である。 RewardBenchで最も高い8B生成モデルである。
論文参考訳（メタデータ） (Mon, 27 Jan 2025 15:09:08 GMT)
上述のEvaluaterチームの論文

Phi-4-Mini Technical Report: Compact yet Powerful MultimodalLanguage Models via Mixture-of-LoRAs
Phi-4MiniとPhi-4-Multimodal、コンパクトで高機能な言語とマルチモーダルモデルを紹介します。Phi-4-Miniは、高品質なウェブおよび合成データに基づいて訓練された3.8ビリオンパラメータ言語モデルである。Phi-4-Multimodalは、テキスト、視覚、音声/音声入力モダリティを単一のモデルに統合するマルチモーダルモデルである。
phi_4_mm.tech_report.02252025.pdf · microsoft/Phi-4-multimodal-instruct at main

OpenAI GPT-4.5 System Card
GPT-4.5は事前トレーニングをさらにスケールし、強力なSTEM焦点推論モデルよりも汎用的に設計されている。幅広い知識ベース、ユーザーの意図とのより強固な連携、感情的知性の向上は、執筆、プログラミング、実用的な問題解決といったタスクに適している。
OpenAI GPT-4.5 System Card | OpenAI

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging

An Open Recipe: Adapting Language-Specific LLMs to a Reasoning Model in One Day via Model Merging [12.1]
本稿では,言語固有の大規模言語モデル(LLM)の推論能力の向上を目的とする。 DeepSeek R1は推論に優れていますが、主に英語や中国語のような高リソース言語にメリットがあります。低リソース言語は、英語中心のトレーニングデータとモデル最適化の優位性のため、いまだに保存されていない。
論文参考訳（メタデータ） (Thu, 13 Feb 2025 08:10:45 GMT)
LLMの推論能力を高めるためのモデルマージ+SFT、「We demonstrate that, with only publicly available datasets and a computational budget of $120, it is possible to enhance the reasoning capabilities of language-specific LLMs to match the level of DeepSeek R1, without compromising their performance on target language tasks.」とのこと
Qwen2.5とDeepSeek R1を利用した日本語大規模言語モデル「Qwen2.5 Bakeneko 32B」シリーズを公開｜rinna株式会社でも近いアプローチをとっているように見える。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31