arXiv最新論文の紹介

Qwen3, Phi-4 reasoning, MiMo 7B, OLMo2 1B, Mellum 4B

先週はオープンなモデルのニュースが多かった。その中でもQwen3は大きなニュースである（Qwen3: Think Deeper, Act Faster | Qwen）。MoEなQwen3-235B-A22B, Qwen3-30B-A3Bの他、denseなQwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, Qwen3-0.6Bが公開されている（Qwen3 – a Qwen Collection）。ライセンスはApache-2。また、MicrosoftのPhi-4のreasoningモデル公開（Showcasing Phi-4-Reasoning: A Game-Changer for AI Developers | Microsoft Community Hub、huggingface）も注目。

SLMの発表も多く、XiaomiによりMiMo（GitHub – XiaomiMiMo/MiMo: MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining）、Ai2によるOLMo release notes | Ai2が興味深い。JetBrainによるMellum（Mellum Goes Open Source: A Purpose-Built LLM for Developers, Now on Hugging Face | The JetBrains Blog）は「Mellum doesn’t try to know everything. It’s designed to do one thing really well: code completion. We call it a focal model – built with purposeful depth and not concerned with chasing breadth.」とある通り特化型。現状、Mellumは十分な性能とは言い難いものの、SLMを特化して強化する、コスパを上げる方向は有望。DeepseekProver-V2の671Bは凄いが、7Bのうまい活用のような組み合わせも重要になると思う。

Phi-4-reasoning Technical Report [42.5]
Phi-4-reasoningは14ビリオンのパラメータ推論モデルであり、複雑な推論タスクにおいて高い性能を実現する。我々はPhi-4-reasoning-plusを開発した。どちらのモデルもDeepSeek-R1-Distill-Llama-70Bモデルのような大きなオープンウェイトモデルよりも優れており、完全なDeepSeek-R1モデルのパフォーマンスレベルに近づいている。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 05:05:09 GMT)
Phi-4シリーズのLRM

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1]
CoT(Chain-of-Thought)は大規模言語モデル(LLM)の形式推論能力を著しく向上させるしかし、Small Language Models (SLM) における推論の改善は、モデル能力が限られているため、依然として困難である。本研究では,(1)多種多様な蒸留長CoTデータによる大規模中等教育,(2)高品質長CoTデータによる微調整,(3)厳格な選好データセットを活用したロールアウトDPO,(4)検証リワードを用いた強化学習(RL)の4段階からなるSLMの体系的トレーニングレシピを提案する。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 00:04:35 GMT)
SLMを利用したreasoningモデルの構築。「The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e g , outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-DistillLlama-8B by 7.7 points on Math-500.」と効果を確認とのこと。
小型のモデルであってもreasoningが有効という興味深い結果。

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition [24.5]
我々はDeepSeek-Prover-V2を紹介します。このモデルは、ニューラル定理の証明における最先端のパフォーマンスを達成し、ミニF2Fテストで88.9%のパス比に達し、PutnamBenchの658問題のうち49を解決した。標準ベンチマークに加えて、325の形式化された問題の集合であるProverBenchを導入し、最近のAIMEコンペティションから選択された15の問題を含む評価を強化した。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 16:57:48 GMT)
「We first prompt DeepSeek-V3 to generate a natural-language proof sketch while simultaneously formalizing it into a Lean statement with sorry placeholders for omitted proof details. A 7B prover model then recursively solves the decomposed subgoals. By combining these subgoal proofs, we construct a complete formal proof for the original complex problem.This composed proof is appended to DeepSeek-V3’s original chain-of-thought, creating high-quality cold-start training data for formal mathematical reasoning. 」
リポジトリはGitHub – deepseek-ai/DeepSeek-Prover-V2

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Reinforcement Learning for Reasoning in Large Language Models with One Training Example [129.1]
1つのトレーニング例(1ショットRLVR)を用いた強化学習は,大規模言語モデル(LLM)の算数推論能力の向上に有効であることを示す。 1ショットRLVRにおける興味深い現象として、クロスドメインの一般化、自己回帰の頻度の増大、トレーニング精度が飽和した後もテスト性能の向上が維持されていることを挙げる。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 09:24:30 GMT)
「We find that selecting one specific example as the training dataset can achieve similar downstream performance to that of the 1.2k DeepScaleR subset (DSR-sub) containing that example. Specifically, this improves the Qwen2.5-Math-1.5B model from 36.0% to 73.6% on MATH500, and from 17.6% to 35.7% on average across 6 mathematical reasoning benchmarks (Fig. 1, 2).」という興味深い報告。「These findings suggest that the reasoning capability of the model is already buried in the base model, and encouraging exploration on a very small amount of data is capable of generating useful RL training signals for igniting LLM’s reasoning capability.」はそうなのだろうと思う。LLMの中には何が入っていてチューニングって何をしているんだろう。。。
リポジトリはGitHub – ypwang61/One-Shot-RLVR: official repository for “Reinforcement Learning for Reasoning in Large Language Models with One Training Example”

ReasonIR: Training Retrievers for Reasoning Tasks

ReasonIR: Training Retrievers for Reasoning Tasks [139.5]
ReasonIR-8Bは一般的な推論タスクのために特別に訓練された最初のレトリバーである。新たに29.9 nDCG@10をリランカなしで、36.9 nDCG@10をリランカで達成している。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 09:49:28 GMT)
合成データを活用し「We trained REASONIR-8B by fine-tuning LLAMA3.1-8B (Touvron et al , 2023) on a combination of public datasets and the synthetic data generated by REASONIR-SYNTHESIZER.」と構築された bi-encoder retrieverの提案。このような手法を用いてなお、BM25とのハイブリッドが有効という点も興味深い。
リポジトリはGitHub – facebookresearch/ReasonIR: Official repository for paper “ReasonIR Training Retrievers for Reasoning Tasks”.、reasonir/ReasonIR-8B · Hugging Face

The Leaderboard Illusion

The Leaderboard Illusion [30.2]
アリーナは最も有能なAIシステムランキングのリーダーボードとして登場した。我々は,ゆがんだ競技場に生じた体系的な問題を同定する。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 15:48:49 GMT)
Chatbot Arena に対する問題点の指摘と改善提案
「We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired.」、「At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release.」は確かに問題
リーダーボードの設計、運用はとても難しいが、できるところは改善を期待したい

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs [34.4]
メモリは情報をエンコードし、保存し、検索するプロセスである。大規模言語モデル(LLM)の時代において、メモリとは、AIシステムが過去のインタラクションからの情報を保持し、リコールし、使用し、将来の応答とインタラクションを改善する能力である。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 15:05:04 GMT)
取り扱いが難しいLLMの記憶に関するサーベイ。
様々な手法が提案されているものの解決すべき課題が多い。Open Problems and Future Directionsがとても参考になる。

Learning Adaptive Parallel Reasoning with Language Models

Learning Adaptive Parallel Reasoning with Language Models [70.2]
本稿では,適応並列推論(Adaptive Parallel Reasoning, APR)を提案する。 APRは、spawn()とjoin()操作を使用して適応的なマルチスレッド推論を可能にすることで、既存の推論メソッドを一般化する。鍵となる革新は、親と子の両方の推論スレッドを最適化して、事前に定義された推論構造を必要とせずにタスクの成功率を高める、エンドツーエンドの強化学習戦略である。
論文参考訳（メタデータ） (Mon, 21 Apr 2025 22:29:02 GMT)
「We presented Adaptive Parallel Reasoning, which enables language models to adaptively distribute computation across serial and parallel reasoning paths using a parent-child threading mechanism.」と自然言語処理というよりも探索に近いなーと思わなくもない手法の提案。有効なのは確かだと思う。
リポジトリはGitHub – Parallel-Reasoning/APR: Code for Paper: Learning Adaptive Parallel Reasoning with Language Models

On The Landscape of Spoken Language Models: A Comprehensive Survey

On The Landscape of Spoken Language Models: A Comprehensive Survey [144.1]
音声言語モデル(SLM)は、普遍的な音声処理システムとして機能する。この領域での作業は非常に多様であり、様々な用語と評価設定がある。
論文参考訳（メタデータ） (Fri, 11 Apr 2025 13:40:53 GMT)
「In the last few years, the field of natural language processing (NLP) has evolved from (1) training many task-specific models from scratch, to (2) combining pre-trained multi-purpose contextual representation models (such as BERT (Devlin et al , 2019)) with a small number of task-specific parameters, to (3) training generative universal, large language models (LLMs (Brown et al , 2020; OpenAI et al , 2024)1) that perform arbitrary text tasks given natural language instructions (prompts) and can generalize to unseen domains and tasks (Wei et al , 2022a; Liu et al , 2023), and finally to (4) dialogue / chatbot systems that function as assistants and perform tasks while directly interacting with the user.」、「The field of speech processing has been undergoing a similar evolution, although with some lag, and has mainly focussed on stages (1) and (2).」から始まるspoken language models (SLMs) のサーベイ。

WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

WALL-E 2.0: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents [55.6]
本研究では,大規模言語モデル(LLM)を補完する環境の記号的知識を学習する「世界アライメント」を提案する。また、モデル予測制御フレームワークを用いて、RLフリーでモデルベースエージェント「WALL-E 2.0」を提案する。 WALL-E 2.0は、火星(Minecraftのような)とALFWorld(emboded indoor environment)のオープンワールド課題における既存の手法を著しく上回っている
論文参考訳（メタデータ） (Tue, 22 Apr 2025 10:58:27 GMT)
「Can we build accurate world models out of large language models (LLMs)? How can world models benefit LLM agents?」から始まる論文。「We have demonstrated that LLMs can effectively serve as world models for agents when aligned with environment dynamics via neurosymbolic knowledge learning.」で既存ベンチマークで効果を確認とのこと。
リポジトリはGitHub – elated-sawyer/WALL-E: Official code for the paper: WALL-E: World Alignment by NeuroSymbolic Learning improves World Model-based LLM Agents

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions [35.8]
LLM(Large Language Models)の指数関数的成長は、絶え間なく拡大する計算およびデータ要求を満たすための効率的な戦略の必要性を強調し続けている。本調査は、知識蒸留(KD)とデータセット蒸留(DD)の2つの相補的パラダイムを包括的に分析する。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 23:50:23 GMT)
蒸留に関するサーベイ
「Crucially, the success of KD in LLMs hinges on DD techniques, which enable the creation of compact, informationrich synthetic datasets that encapsulate the diverse and complex knowledge of the teacher LLMs.」とKnowledge distillationとDataset distillationを対としてサーベイするものは珍しいかもしれない

Multilingual Performance Biases of Large Language Models in Education

Multilingual Performance Biases of Large Language Models in Education [39.1]
大規模言語モデル(LLM)は、教育環境においてますます採用されている。この研究は、非英語の教育環境での使用が保証されているかどうかを確かめるものである。
論文参考訳（メタデータ） (Thu, 24 Apr 2025 16:32:31 GMT)
「However, we note that certain models can do terribly on some tasks and languages, so we recommend first verifying that a particular model works well in a particular language on a specific education-related task before deployment.」というまっとうな指摘はあるものの、「Particularly, we find that GPT4o and Gemini 2.0 perform consistently well across all languages with a few exceptions.」と多言語対応はかなり進んでいる雰囲気を感じる。
リポジトリはGitHub – eth-lre/multilingual-educational-llm-bias: Data and code for “Multilingual Performance Biases of Large Language Models in Education”

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30