蒸留 – arXiv最新論文の紹介

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention [117.9]
PRISMは、SLM(Small Language Model)対応ロボットプランナーを蒸留するためのフレームワークである。 PRISMを3つのLCM対応プランナーに適用し、マッピング、探索、操作、家事支援を行う。 GPT-4o の 10-20% から 93% 以上まで, PRISM は Llama-3.2-3B の性能を向上することを示した。
論文参考訳（メタデータ） (Fri, 20 Jun 2025 21:44:27 GMT)
robot planningを対象とした「Given a source LLM-enabled planner, PRISM synthesizes tasks and environments, elicits plans from the LLM-enabled planner in these synthesized environments, and then uses the resulting data to train an SLM-enabled planner that serves as a drop-in replacement for the source model.」という蒸留フレームワークの提案。直観的にも有効そうだが実際有望な結果。
プロジェクトサイトはPRISM

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

DD-Ranking: Rethinking the Evaluation of Dataset Distillation [223.3]
本稿では,統合評価フレームワークであるDD-Rankingと,異なる手法によって達成された真の性能改善を明らかにするための新しい総合評価指標を提案する。 DD-Rankingは、蒸留データセットの実際の情報強化に再焦点をあてることで、将来の研究の進展に対してより包括的で公正な評価基準を提供する。
論文参考訳（メタデータ） (Mon, 19 May 2025 16:19:50 GMT)
データセット蒸留に対するベンチマークの提案。「It aims to provide a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data. Under the finding that the test accuracy no longer fits the need for fair and comprehensive evaluation, we design new metrics for both the label representation and data augmentation.」とのこと。モチベーションの一つになっているものだが「DD-Ranking demonstrate that previous performance improvements commonly originate from the enhanced model training techniques instead of the distilled dataset.」という指摘も興味深い。
リポジトリはGitHub – NUS-HPC-AI-Lab/DD-Ranking: Data distillation benchmark

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions

Knowledge Distillation and Dataset Distillation of Large Language Models: Emerging Trends, Challenges, and Future Directions [35.8]
LLM(Large Language Models)の指数関数的成長は、絶え間なく拡大する計算およびデータ要求を満たすための効率的な戦略の必要性を強調し続けている。本調査は、知識蒸留(KD)とデータセット蒸留(DD)の2つの相補的パラダイムを包括的に分析する。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 23:50:23 GMT)
蒸留に関するサーベイ
「Crucially, the success of KD in LLMs hinges on DD techniques, which enable the creation of compact, informationrich synthetic datasets that encapsulate the diverse and complex knowledge of the teacher LLMs.」とKnowledge distillationとDataset distillationを対としてサーベイするものは珍しいかもしれない

Antidistillation Sampling

Antidistillation Sampling [98.9]
拡張推論トレースを生成するモデルは、モデル蒸留を容易にするリッチトークンシーケンスを不注意に生成する。この脆弱性を認識したモデル所有者は、モデル性能を損なうことなく蒸留の有効性を制限するサンプリング戦略を求めることができる。抗蒸留サンプリング毒は痕跡を推し進め、モデルの実用性を保ちながら蒸留の効力を著しく低下させた。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 17:54:14 GMT)
タイトルの通り蒸留を困難にするサンプリング戦略の提案
プロジェクトサイトはAntidistillation Sampling

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models

TAID: Temporally Adaptive Interpolated Distillation for Efficient Knowledge Transfer in Language Models [6.8]
本稿では,新規な知識蒸留法である$textitTemporally Adaptive Interpolated Distillation (TAID)$を紹介する。 TAIDは,各種モデルサイズおよびアーキテクチャに対して,命令チューニングと事前学習のシナリオにおいて優れた性能を示す。これらの結果は、TAIDが高性能で効率的なモデルの作成に有効であることを示し、よりアクセスしやすいAI技術の開発を推進している。
論文参考訳（メタデータ） (Wed, 29 Jan 2025 05:51:25 GMT)
「TAID reduces the gap between teacher and student model throughout the training process by dynamically introducing an intermediate teacher that interpolates teacher and student models to provide a target distribution with a modest capability」という蒸留法の提案
ニュースリリースは新手法「TAID」を用いた小規模日本語言語モデル「TinySwallow-1.5B」の公開、リポジトリはTinySwallow – a SakanaAI Collection
Deepseek R1のようにライセンス上蒸留を許可しているLRM/LLMが出てきたことによるこの手の手法の重要性が上がっているように思う。

MINITRON / Compact Language Models via Pruning and Knowledge Distillation

Compact Language Models via Pruning and Knowledge Distillation [61.6]
ミニトロンモデルでは、スクラッチからのトレーニングに比べてMMLUスコアが最大16%改善している。すでにトレーニング済みの15Bモデルから8Bと4Bモデルを抽出するには、スクラッチからトレーニングするよりも、モデル毎のトレーニングトークンを最大40倍少なくする必要があります。
論文参考訳（メタデータ） (Fri, 19 Jul 2024 21:47:57 GMT)
Nemotron 15Bから得られた高性能な8Bモデル及び4Bモデル。pruningとdistillationを組み合わせたベストプラクティスを報告。Gemma2, CriticGPT – arXiv最新論文の紹介 (devneko.jp)のときも蒸留が用いられていたが、大規模なモデルから小規模高性能なモデルを得るような手順が一般的になるのだろうか・・・
リポジトリはGitHub – NVlabs/Minitron: A family of compressed models obtained via pruning and knowledge distillation

SOCRATIC COT

Distilling Reasoning Capabilities into Smaller Language Models [83.7]
思考の連鎖(CoT)のようなステップバイステップの推論アプローチは、大規模言語モデルにおける推論能力の誘導に非常に効果的であることが証明されている。しかし、CoTアプローチの成功は基本的にモデルのサイズに結びついており、CoTを機能させるためには数十億のパラメータスケールモデルが必要であることが多い。本研究では,大規模モデルのCoT推論能力を段階的に活用し,これらの能力をより小さなモデルに蒸留する知識蒸留手法を提案する。
論文参考訳（メタデータ） (Thu, 18 May 2023 04:44:51 GMT)
大規模なモデルから得たCoTの出力を小さなモデルに適用する取り組み。CoTをより細かいQAに分解し、Question GeneratorモデルとQAモデルを学習する仕組みのよう。小さなモデル (GPT-2 large) で10倍のモデル (GPT-3 6B)をout performしたとのこと。
リポジトリはGitHub – kumar-shridhar/Distiiling-LM: The code for the paper : Distilling Reasoning Capabilities into Smaller Language Models

What do LLMs Know about Financial Markets? A Case Study on Reddit Market Sentiment Analysis

What do LLMs Know about Financial Markets? A Case Study on Reddit Market Sentiment Analysis [15.2]
ソーシャルメディアコンテンツに対する市場の感情分析には、金融市場とソーシャルメディアのジャーゴンの両方の知識が必要である。我々のパイプラインは、大きな言語モデル(LLM)を用いたReddit投稿の弱い財務感情ラベルを生成する。少数のプロンプトだけで、最終モデルは既存の教師付きモデルと同等に実行される。
論文参考訳（メタデータ） (Wed, 21 Dec 2022 19:11:19 GMT)
大規模言語モデルから知識を得て小さなモデルを学習、ベースラインよりも優れた性能を達成、という報告。金融領域というのも興味深い。（本論ではないがPaLM＋CoTめっちゃ優秀やなという感想）

小さなデータで効率的に学習するためのDataset distillation

Dataset Distillation by Matching Training Trajectories [75.9]
そこで本研究では,実データと同じような状態にネットワークを誘導するために,蒸留データを最適化する新しい定式化を提案する。ネットワークが与えられたら、蒸留データを何回か繰り返して訓練し、合成訓練されたパラメータと実データで訓練されたパラメータとの距離に関して蒸留データを最適化する。本手法は既存の手法よりも優れており,高解像度の視覚データを蒸留することができる。
論文参考訳（メタデータ） (Tue, 22 Mar 2022 17:58:59 GMT)
- 多くの画像を用いて効率的に学習可能な合成データを作成する研究。
  - Deep Learning的には効率的に学習可能でもやや不気味な画像ではある・・・
- リポジトリはDataset Distillation by Matching Training Trajectories (georgecazenavette.github.io)、データセットの提供もされている

ERNIE 3.0 Titan: the largest Chinese dense pre-trained model

ERNIE 3.0 Titan: Exploring Larger-scale Knowledge Enhanced Pre-training for Language Understanding and Generation [50.0]
GPT-3は、事前訓練された言語モデルをスケールアップすることで、その潜在能力をさらに活用できることを示した。 ERNIE 3.0のスケールアップ性能を調べるため、PaddlePaddleプラットフォーム上で最大2600億のパラメータを持つERNIE 3.0 Titanをトレーニング、様々なNLPタスクにおいて最先端のモデルよりも優れていた。
論文参考訳（メタデータ） (Thu, 23 Dec 2021 17:35:48 GMT)
- Baiduの巨大言語モデル、68のNLPデータセットでSoTAとのこと。
- 学習をGPUとAscend 910を併用しヘテロジニアスな構成で行う、推論もNvidia A100-SXM4(40GB)では不可能で分散実施とインフラ部分も興味津々
- 一度に複数の生徒をトレーニング可能なOnline Distillation Frameworkを提案しているのも興味深い

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31