Dataset distillation – arXiv最新論文の紹介

DD-Ranking: Rethinking the Evaluation of Dataset Distillation

DD-Ranking: Rethinking the Evaluation of Dataset Distillation [223.3]
本稿では,統合評価フレームワークであるDD-Rankingと,異なる手法によって達成された真の性能改善を明らかにするための新しい総合評価指標を提案する。 DD-Rankingは、蒸留データセットの実際の情報強化に再焦点をあてることで、将来の研究の進展に対してより包括的で公正な評価基準を提供する。
論文参考訳（メタデータ） (Mon, 19 May 2025 16:19:50 GMT)
データセット蒸留に対するベンチマークの提案。「It aims to provide a fair evaluation scheme for DD methods that can decouple the impacts from knowledge distillation and data augmentation to reflect the real informativeness of the distilled data. Under the finding that the test accuracy no longer fits the need for fair and comprehensive evaluation, we design new metrics for both the label representation and data augmentation.」とのこと。モチベーションの一つになっているものだが「DD-Ranking demonstrate that previous performance improvements commonly originate from the enhanced model training techniques instead of the distilled dataset.」という指摘も興味深い。
リポジトリはGitHub – NUS-HPC-AI-Lab/DD-Ranking: Data distillation benchmark

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement

SoTA with Less: MCTS-Guided Sample Selection for Data-Efficient Visual Reasoning Self-Improvement [100.9]
ThinkLite-VLはQwen2.5-VL-7Bインストラクションの平均性能を7%向上させる。私たちのコード、データ、モデルはhttps://github.com/si0wang/ThinkLite-VL.orgで公開されています。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 17:49:05 GMT)
効率のよいVision-Languageモデルの推論強化方法の提案。「Our model achieves SoTA performance using only 11k data, and without any additional knowledge distillation.」と使用データが少ない。カギはデータ品質とのこと「Our key insight highlights the critical importance of selecting genuinely challenging examples for Reinforcement Fine-Tuning (RFT).」
リポジトリはGitHub – si0wang/ThinkLite-VL

Predictive Data Selection: The Data That Predicts Is the Data That Teaches

Predictive Data Selection: The Data That Predicts Is the Data That Teaches [19.0]
予測データ選択(PreSelect)は,高速テキストベースのスコアラのみのトレーニングとデプロイを必要とする軽量で効率的なデータ選択手法である。我々は、PreSelectで選択された30Bトークンでトレーニングされたモデルが300Bトークンでトレーニングされたバニラベースラインのパフォーマンスを上回ることを示した。
論文参考訳（メタデータ） (Tue, 04 Mar 2025 06:15:27 GMT)
「Building on this observation, we hypothesize that data on which model losses are predictive of downstream abilities also contribute effectively to learning.」という仮定の下設計されたデータ選択手法PRESELECTの提案。「PRESELECT demonstrates remarkable performance, with an average absolute improvement of 2.8% over the random selection and 20% gains in Math and Code raw text BPC, which shows a promising trend.」と効果を主張。
リポジトリはGitHub – hkust-nlp/PreSelect

Data Selection via Optimal Control for Language Models

Data Selection via Optimal Control for Language Models [134.7]
本研究は,大規模コーパスから高品質な事前学習データを選択することにより,下流利用におけるLMの能力を向上させることを目的とする。 PMP条件を解くことで最適なデータ選択を近似するフレームワークであるPMPベースのデータ選択(PDS)を導入する。 PDSの利点は、スケーリング法則に従ってテスト損失曲線の外挿によって証明されたように、10Tトークンでトレーニングされた400Bモデルにまで拡張される。
論文参考訳（メタデータ） (Wed, 09 Oct 2024 17:06:57 GMT)
「by treating data selection as the control variables (i.e., whether a data point is included in pre-training), the LM pre-training process as the dynamic system, and the LM’s downstream performance as the objective, we leverage Pontryagin’s Maximum Principle (PMP; 63) to derive the necessary conditions for optimal data selection in theory.」という制御理論を応用したデータセレクション手法の提案。「The overhead of running PDS to select data is only about 1/9 of that of pre-training a 1.7B model.」と実用的に思える。
プロジェクトサイトはAdvancing AI for Humanity (thegenerality.com)、リポジトリはLMOps/data_selection at main · microsoft/LMOps · GitHub

DataComp-LM: In search of the next generation of training sets for language models

DataComp-LM: In search of the next generation of training sets for language models [193.3]
DataComp for Language Models (DCLM)は、制御されたデータセット実験のためのテストベッドであり、言語モデルを改善することを目的としている。我々は、Common Crawlから抽出された240Tトークンの標準化コーパス、OpenLMフレームワークに基づく効果的な事前学習レシピ、53の下流評価スイートを提供する。 DCLMベンチマークの参加者は、412Mから7Bパラメータのモデルスケールでの重複、フィルタリング、データ混合などのデータキュレーション戦略を実験することができる。
論文参考訳（メタデータ） (Mon, 17 Jun 2024 17:42:57 GMT)
言語モデルトレーニング時のデータキュレーションのためのベンチマークDataComp for Language Models (DCLM)の提案。重要なベンチマークで小さめのトラックも用意されている（最小トラックは412Mパラメータ、8.2B学習用トークン（元データ469B）、学習用の計算量は2.0e19FLOPs、H100換算で26時間）が、それにしても参加するにも結構な環境が必要そう。。。
プロジェクトサイトはDataComp

関連するものとして下記論文も参考になる。

Data-Centric AI in the Age of Large Language Models [51.2]
本稿では,大規模言語モデル(LLM)に着目した,AI研究におけるデータ中心の視点を提案する。本研究では,LLMの発達段階(事前学習や微調整など)や推論段階(文脈内学習など)において,データが有効であることを示す。データを中心とした4つのシナリオを特定し、データ中心のベンチマークとデータキュレーション、データ属性、知識伝達、推論コンテキスト化をカバーします。
論文参考訳（メタデータ） (Thu, 20 Jun 2024 16:34:07 GMT)
LLMの時代においてもデータは重要、DataCOMPについては「DataComp is a more suitable starting point due to its scale and the promising initial findings.」と記載。

TIVE: Task-level and Instance-level Value Estimation

Less is More: Data Value Estimation for Visual Instruction Tuning [127.4]
視覚的命令データにおける冗長性を除去する新しいデータ選択手法を提案する。 LLaVA-1.5の実験では、約7.5%のデータしか使用していないアプローチが、フルデータ微調整モデルと同等の性能を達成できることが示されている。
論文参考訳（メタデータ） (Thu, 14 Mar 2024 16:47:25 GMT)
visual instruction datasetには不要・冗長なデータが多く含まれており、その重要性を評価して削減する手法を提案。「using only about 7.5% data can achieve comparable performance as the full-data fine-tuned model across seven benchmarks, even surpassing it on four of the benchmarks.」とのことで、非常に効果的に見える。
「Our code and data will be publicly released.」らしい

Effective pruning of web-scale datasets based on complexity of concept clusters

Effective pruning of web-scale datasets based on complexity of concept clusters [48.1]
本稿では,大規模なマルチモーダルデータセットを抽出し,イメージネット上でCLIPスタイルのモデルを訓練する手法を提案する。高品質なデータのより小さなセットでのトレーニングは、トレーニングコストを大幅に削減し、より高いパフォーマンスをもたらす可能性があることに気付きました。 DataComp Mediumのベンチマークでは,38のタスクに対して,最先端のImageNetゼロショット精度と競合平均ゼロショット精度を実現している。
論文参考訳（メタデータ） (Tue, 9 Jan 2024 14:32:24 GMT)
データセットの効果的なフィルタリング方法の提案。LAION datasetで検証。
deduplication, CLIP-score filtering, Density-Based-Pruningのパイプラインでembeddingを効果的に使うアプローチ

AlpaGasus: Training A Better Alpaca with Fewer Data

AlpaGasus: Training A Better Alpaca with Fewer Data [106.9]
52kのAlpacaデータからフィルタした9kの高品質データのみを微調整したAlpaGasusを紹介する。 AlpaGasus は、複数のテストセットで GPT-4 で評価されたオリジナルの Alpaca を著しく上回っている。また、5.7倍高速な訓練も提供し、7B型の訓練時間を80分(アルパカ用)から14分に短縮した。
論文参考訳（メタデータ） (Mon, 17 Jul 2023 17:59:40 GMT)
LLMを用いてinstruction-finetuning用データを高品質化、品質の高い少数データの利用が有効だったという報告。instruction-finetuningのデータ品質の重要性は他の論文でも指摘されており（ゆえにRLHFが有効という話もあり）参考になる。
プロジェクトサイトはAlpaGasus: Training a Better Alpaca with Fewer Data (lichang-chen.github.io)

DoReMi: Domain Reweighting with Minimax Optimization

DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining [172.3]
ミニマックス最適化(DoReMi)を用いたドメイン再重み付けを提案する。 DoReMiはまず、ドメイン上のグループ分散ロバスト最適化(Group DRO)を使用して小さなプロキシモデルをトレーニングし、ドメイン重みを生成する。次に、これらのドメインウェイトでデータセットを再サンプリングし、より大きなフルサイズのモデルをトレーニングします。
論文参考訳（メタデータ） (Wed, 17 May 2023 17:58:13 GMT)
データセットのドメインに対するウェイトを調整する手法の提案。小さなモデルで試行後に大きなモデルでのドメインウェイトを決めるアプトローチで「DoReMi improves average one-shot downstream accuracy by 6.5% and reaches the baseline accuracy 2.6x faster when pretraining on The Pile.」ととても効果的そう
The Pileを用いた実験でWikipediaのウェイトがベースラインよりも低くなっているにもかかわらず、Wikipedia由来のデータセットでのdown stream性能が上がっているのが面白い。なぜなんだろう・・・？

Dataset Distlillationのサーベイ

最近よく見るデータセット蒸留のサーベイ。基本的には少ないデータで十分な性能のモデル構築ができるようなデータセット作成を目的にしているが、生データを公開しなくてもよくなる場合があり情報保護の観点からも重要な技術になりうる。アプローチも様々で興味深い。

Dataset Distillation: A Comprehensive Review [54.3]
データセット蒸留(DD)は、いくつかの合成サンプルを含むはるかに小さなデータセットを目標としている。本稿では,最近のDDの進歩と応用について概説する。
論文参考訳（メタデータ） (Tue, 17 Jan 2023 17:03:28 GMT)

A Comprehensive Survey to Dataset Distillation [91.4]
限られた計算能力で無制限に成長するデータに対処することは困難になっている。ディープラーニング技術はこの10年で前例のない発展を遂げた。本稿では,多面的なデータセット蒸留の総合的な理解を提供する。
論文参考訳（メタデータ） (Fri, 13 Jan 2023 15:11:38 GMT)

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31