SLM – arXiv最新論文の紹介

The Rise of Small Language Models in Healthcare: A Comprehensive Survey

The Rise of Small Language Models in Healthcare: A Comprehensive Survey [8.6]
小型言語モデル(SLM)は、次世代医療情報学にスケーラブルで臨床的に実行可能なソリューションを提供する。包括的調査では、医療従事者に対して分類・分類するための分類学的枠組みを提示する。本研究は,医療におけるSLMの変容可能性を明らかにするために,広く研究されているNLPタスクを対象とした実験結果のまとめである。
論文参考訳（メタデータ） (Wed, 23 Apr 2025 22:02:25 GMT)
ヘルスケアにおけるSLMのサーベイ。
リポジトリはGitHub – drmuskangarg/SLMs-in-healthcare: Unlike vanilla contextual pre-trained fundamentally \textit{small} language models (e.g., ClinicalBERT), our interest lies in compressed and optimized approaches for language models in healthcare, developed as a resource-efficient and domain-specialized solution to LLMs.で、かなり多くのモデルが構築されていることが分かる。

Qwen3, Phi-4 reasoning, MiMo 7B, OLMo2 1B, Mellum 4B

先週はオープンなモデルのニュースが多かった。その中でもQwen3は大きなニュースである（Qwen3: Think Deeper, Act Faster | Qwen）。MoEなQwen3-235B-A22B, Qwen3-30B-A3Bの他、denseなQwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, Qwen3-0.6Bが公開されている（Qwen3 – a Qwen Collection）。ライセンスはApache-2。また、MicrosoftのPhi-4のreasoningモデル公開（Showcasing Phi-4-Reasoning: A Game-Changer for AI Developers | Microsoft Community Hub、huggingface）も注目。

SLMの発表も多く、XiaomiによりMiMo（GitHub – XiaomiMiMo/MiMo: MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining）、Ai2によるOLMo release notes | Ai2が興味深い。JetBrainによるMellum（Mellum Goes Open Source: A Purpose-Built LLM for Developers, Now on Hugging Face | The JetBrains Blog）は「Mellum doesn’t try to know everything. It’s designed to do one thing really well: code completion. We call it a focal model – built with purposeful depth and not concerned with chasing breadth.」とある通り特化型。現状、Mellumは十分な性能とは言い難いものの、SLMを特化して強化する、コスパを上げる方向は有望。DeepseekProver-V2の671Bは凄いが、7Bのうまい活用のような組み合わせも重要になると思う。

Phi-4-reasoning Technical Report [42.5]
Phi-4-reasoningは14ビリオンのパラメータ推論モデルであり、複雑な推論タスクにおいて高い性能を実現する。我々はPhi-4-reasoning-plusを開発した。どちらのモデルもDeepSeek-R1-Distill-Llama-70Bモデルのような大きなオープンウェイトモデルよりも優れており、完全なDeepSeek-R1モデルのパフォーマンスレベルに近づいている。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 05:05:09 GMT)
Phi-4シリーズのLRM

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1]
CoT(Chain-of-Thought)は大規模言語モデル(LLM)の形式推論能力を著しく向上させるしかし、Small Language Models (SLM) における推論の改善は、モデル能力が限られているため、依然として困難である。本研究では,(1)多種多様な蒸留長CoTデータによる大規模中等教育,(2)高品質長CoTデータによる微調整,(3)厳格な選好データセットを活用したロールアウトDPO,(4)検証リワードを用いた強化学習(RL)の4段階からなるSLMの体系的トレーニングレシピを提案する。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 00:04:35 GMT)
SLMを利用したreasoningモデルの構築。「The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e g , outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-DistillLlama-8B by 7.7 points on Math-500.」と効果を確認とのこと。
小型のモデルであってもreasoningが有効という興味深い結果。

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition [24.5]
我々はDeepSeek-Prover-V2を紹介します。このモデルは、ニューラル定理の証明における最先端のパフォーマンスを達成し、ミニF2Fテストで88.9%のパス比に達し、PutnamBenchの658問題のうち49を解決した。標準ベンチマークに加えて、325の形式化された問題の集合であるProverBenchを導入し、最近のAIMEコンペティションから選択された15の問題を含む評価を強化した。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 16:57:48 GMT)
「We first prompt DeepSeek-V3 to generate a natural-language proof sketch while simultaneously formalizing it into a Lean statement with sorry placeholders for omitted proof details. A 7B prover model then recursively solves the decomposed subgoals. By combining these subgoal proofs, we construct a complete formal proof for the original complex problem.This composed proof is appended to DeepSeek-V3’s original chain-of-thought, creating high-quality cold-start training data for formal mathematical reasoning. 」
リポジトリはGitHub – deepseek-ai/DeepSeek-Prover-V2

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [26.0]
Java、TypeScript、JavaScript、Go、Rust、C、C++をカバーするマルチ言語問題解決ベンチマークであるMulti-SWE-benchを紹介します。これには合計1,632の高品質なインスタンスが含まれており、68のエキスパートアノテータによって2,456の候補から慎重にアノテートされた。 3つの代表的手法を用いて,Multi-SWE-benchに基づく一連の最先端モデルの評価を行った。大規模強化学習(RL)トレーニングデータセットの構築を目的とした,オープンソースコミュニティのMulti-SWE-RLを立ち上げた。
論文参考訳（メタデータ） (Thu, 03 Apr 2025 14:06:17 GMT)
「we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++.」というある意味多言語なベンチマーク。基本的にOpenHandsの改修版であるMopenHandsが有力に見えるが、言語間で差があるのが興味深い。
- GitHub – All-Hands-AI/OpenHands: 🙌 OpenHands: Code Less, Make More、OpenHandsはIntroducing OpenHands LM 32B — A Strong, Open Coding Agent Modelとコード生成にチューニングしたLLMを作っているのも面白い。
リポジトリはGitHub – multi-swe-bench/multi-swe-bench: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving、リーダーボードはMulti-SWE-bench
「Multi-SWE-RL is an open-source community aimed at developing high-quality RL training datasets for complex software engineering tasks. Its purpose is to serve as the foundational infrastructure for training fully autonomous agents capable of addressing real-world software engineering challenges, paving the way toward achieving AGI.」とAGIに言及があるのと「In light of these advancements, we are firmly convinced that “scaling RL in real-world environments is the path toward human-like intelligence”.」は熱い。

Self-Taught Self-Correction for Small Language Models

Self-Taught Self-Correction for Small Language Models [16.5]
本研究は,自己生成データのみを用いた反復的微調整により,小言語モデル(SLM)における自己補正を探索する。複数のアルゴリズム設計選択を組み込んだ自己学習自己補正アルゴリズム(STaSC)を導入する。質問応答タスクの実験結果から,STaSCは自己補正を効果的に学習し,性能が大幅に向上することが示された。
論文参考訳（メタデータ） (Tue, 11 Mar 2025 17:57:44 GMT)
STaRに自己補正を様々組み込んだSelf-Taught Self-Correction (STaSC)の提案。
リポジトリはGitHub – VityaVitalich/STASC: [ICLR 2025 SSI-FM] Self-Taught Self-Correction for Small Language Models

Claude 3.7, GPT-4.5, Phi-4, Selene

先週も大きなニュースが多く、AnthropicのClaude 3.7 sonnet、OpenAIのGPT-4.5などフラグシップと呼べるモデルの発表が相次いだ。

Claude 3.7はLLM&LRMというようなモデルでコード生成で高い性能を発揮している。Claude 3.7 Sonnet and Claude Code \ Anthropic

GPT-4.5は巨大・高性能なLLMという印象GPT-4.5 が登場 | OpenAI。LRMでは解きにくい領域ではとても有効そう。ベンチマーク個別では同じLLMのDeepseek V3に負けているものがあり（GitHub – deepseek-ai/DeepSeek-V3のAIME 2024やSWE Verified）、OpenAI一強時代の終わりを感じさせる結果になっている。

このような中、MicrosoftのPhi-4シリーズでも新たなモデルが公開されているWelcome to the new Phi-4 models – Microsoft Phi-4-mini & Phi-4-multimodal。小型モデルでも十分な性能が出ているように見える。

Frontier AI needs frontier evaluators. Meet Selene.など、強力なevaluatorなどLLMやLRMを補完する動きも興味深い。

LLM, LRM, SLMやチューニング、ハイブリッド構成など様々なアプローチがあり、モデルの選択肢も増え、何を選択していくべきか悩む時代になったのかなという印象。

Atla Selene Mini: A General Purpose Evaluation Model [2.9]
我々はSLMJ(Small-as-a-judge)の最先端の小型言語であるAtla Selene Miniを紹介した。 Selene Miniは、全体的なパフォーマンスにおいて最高のSLMJとGPT-4o-miniより優れた汎用評価器である。 RewardBenchで最も高い8B生成モデルである。
論文参考訳（メタデータ） (Mon, 27 Jan 2025 15:09:08 GMT)
上述のEvaluaterチームの論文

Phi-4-Mini Technical Report: Compact yet Powerful MultimodalLanguage Models via Mixture-of-LoRAs
Phi-4MiniとPhi-4-Multimodal、コンパクトで高機能な言語とマルチモーダルモデルを紹介します。Phi-4-Miniは、高品質なウェブおよび合成データに基づいて訓練された3.8ビリオンパラメータ言語モデルである。Phi-4-Multimodalは、テキスト、視覚、音声/音声入力モダリティを単一のモデルに統合するマルチモーダルモデルである。
phi_4_mm.tech_report.02252025.pdf · microsoft/Phi-4-multimodal-instruct at main

OpenAI GPT-4.5 System Card
GPT-4.5は事前トレーニングをさらにスケールし、強力なSTEM焦点推論モデルよりも汎用的に設計されている。幅広い知識ベース、ユーザーの意図とのより強固な連携、感情的知性の向上は、執筆、プログラミング、実用的な問題解決といったタスクに適している。
OpenAI GPT-4.5 System Card | OpenAI

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model

SmolLM2: When Smol Goes Big — Data-Centric Training of a Small Language Model [33.9]
SmolLM2は、最先端の”小” (170億のパラメータ) 言語モデルである。我々はSmolLM2を1兆のトークンでオーバートレーニングし、Webテキストと特殊な算数、コード、命令追従データとを混合する多段階のトレーニングプロセスを用いた。我々は、SmolLM2がQwen2.5-1.5BやLlama3.2-1Bなど、最近の小さなLMよりも優れていることを示した。
論文参考訳（メタデータ） (Tue, 04 Feb 2025 21:43:16 GMT)
HuggingfaceによるSLM、「SmolLM2 advances the state-of-the-art for open small LMs through a combination of careful dataset curation and multistage training.」とのこと。「SmolLM2 outperforms other recent small LMs including Qwen2.5-1.5B and Llama3.2-1B.」を主張
リポジトリはSmolLM2 – a HuggingFaceTB Collection

Benchmarking Large and Small MLLMs

Benchmarking Large and Small MLLMs [71.8]
大規模なマルチモーダル言語モデル(MLLM)は、マルチモーダルコンテンツの理解と生成において顕著な進歩を遂げている。しかし、そのデプロイメントは、遅い推論、高い計算コスト、デバイス上のアプリケーションに対する非現実性など、重大な課題に直面している。 LLavaシリーズモデルとPhi-3-Visionによって実証された小さなMLLMは、より高速な推論、デプロイメントコストの削減、ドメイン固有のシナリオを扱う能力を備えた有望な代替手段を提供する。
論文参考訳（メタデータ） (Sat, 04 Jan 2025 07:44:49 GMT)
MLLMの包括的評価。
「GPT-4o establishes a new standard for multimodal understanding and reasoning across diverse input types, setting a benchmark in versatility and cognitive capacity.」のほか、「Although LLaVA-NeXT and Phi-3-Vision excel in specialized recognition tasks, they exhibit limitations in advanced reasoning and temporal sequence processing.」とのこと。
MSの調査でもあり、Phi4でのアップデートにも期待。microsoft/phi-4 · Hugging Face

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking

rStar-Math: Small LLMs Can Master Math Reasoning with Self-Evolved Deep Thinking [15.4]
本稿では,小型言語モデル (SLM) が OpenAI o1 の算術的推論能力に匹敵するか,超越するかを示すために rStar-Math を提案する。我々はモンテカルロ木探索(MCTS)を通して「深層思考」を実践し,SLMに基づくプロセス報酬モデルによるテスト時間探索を行う。
論文参考訳（メタデータ） (Wed, 08 Jan 2025 14:12:57 GMT)
「In this work, we present rStar-Math, a self-evolved System 2 deep thinking approach that significantly boosts the math reasoning capabilities of small LLMs, achieving state-of-the-art OpenAI o1-level performance.」と流行りのアプローチ、self-evolvedという表現に未来を感じるとともに、比較的小規模なモデルでも高いスコアをとれていることが興味深い
リポジトリはhttps://github.com/microsoft/rStar。現時点では404？

Phi4, InternVL 2.5, EXAONE 3.5

Gemini 2.0やOpenAIの12日間発表で盛り上がっているが、OSSや公開モデルについても様々なモデルが発表されている。

Phi-4 Technical Report [72.1]
本研究では,データ品質に重点を置いた14ビリオンパラメータ言語モデル phi-4 を提案する。多くの言語モデルとは異なり、事前学習は主にWebコンテンツやコードなどの有機データソースに基づいており、phi-4はトレーニングプロセス全体を通して戦略的に合成データを組み込んでいる。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 03:37:41 GMT)
小型、高性能モデルPhiの最新バージョン、「phi-4 strategically incorporates synthetic data throughout the training process.」とのことで合成データをうまく活用するアプローチ。Phi3を超え、GPT-4o miniに迫っている優秀なモデル。
公式Blogでも発表がある　Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning | Microsoft Community Hub

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases [35.0]
EXAONE 3.5言語モデルは32B、7.8B、2.4Bの3つの構成で提供されている。商用利用については、LG AI Researchの公式コンタクトポイントを参照してください。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 09:31:10 GMT)
LGによる公開モデル、同サイズのQwen2.5と競合する性能
リポジトリはLGAI-EXAONE (LG AI Research)

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [121.1]
InternVL 2.5は、InternVL 2.0上に構築された高度マルチモーダル大規模言語モデル(MLLM)シリーズである。 InternVL 2.5は、GPT-4oやClaude-3.5-Sonnetといった主要な商用モデルと競合する競争力を持つ。このモデルが、マルチモーダルAIシステムの開発と適用のための新しい標準を設定することで、オープンソースコミュニティに貢献できることを願っています。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 18:57:08 GMT)
OSSのMLLM、性能は商用モデルと競合的とのこと。「we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.」というアーキテクチャでViTをProjectorでLLMとつなぐアプローチ
リポジトリはOpenGVLab/InternVL2_5-78B · Hugging Face、GitHub – OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.9]
本研究は,ストリーミング映像とオーディオ入力とのリアルタイムインタラクションを実現するために,非絡み合いのストリーミング知覚,推論,メモリ機構を導入している。このプロジェクトは人間のような認知をシミュレートし、多モーダルな大規模言語モデルが時間とともに継続的かつ適応的なサービスを提供できるようにする。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:58:30 GMT)
リアルタイムストリーミングだけでなくメモリ機能なども備えるフレームワーク
リポジトリはGitHub – InternLM/InternLM-XComposer: InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Owl-1: Omni World Model for Consistent Long Video Generation [75.5]
Omni World ModeL (Owl-1) を提案する。 Owl-1 は VBench-I2V と VBench-Long の SOTA メソッドと同等の性能を実現している。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:59:01 GMT)
動画生成モデル、リポジトリはGitHub – huang-yh/Owl

A Survey of Small Language Models

A Survey of Small Language Models [104.8]
小言語モデル (SLM) は, 計算資源の最小化による言語タスクの効率化と性能の向上により, ますます重要になってきている。本稿では,SLMのアーキテクチャ,トレーニング技術,モデル圧縮技術に着目した総合的な調査を行う。
論文参考訳（メタデータ） (Fri, 25 Oct 2024 23:52:28 GMT)
Small Language Model（といっても感覚的には小規模LLM）のサーベイ
「The inherent difficulty of a survey of small language models is that the definitions of “small” and “large” are a function of both context and time. GPT2, a “large language model” in 2019 at 1.5B parameters, is smaller than many “small” language models covered in this survey.」とある通り、Smallとは？というのが大きな疑問。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31