Scaling Law – arXiv最新論文の紹介

Language Models Improve When Pretraining Data Matches Target Tasks

Language Models Improve When Pretraining Data Matches Target Tasks [8.9]
BETRは、ベンチマークトレーニングの例と類似性に基づいて、事前学習した文書を選択する方法である。データ選択の方法は10^19から10^22FLOPにまたがる500以上のモデルをトレーニングし、それらをスケーリング法則に適合させることで比較する。 BETRはDCLM-Baseline上で2.1倍の計算乗算を実現し,全スケールで10タスク中9タスクの性能向上を実現している。
論文参考訳（メタデータ） (Wed, 16 Jul 2025 17:59:45 GMT)
「We tested whether language models improve when pretraining data matches target tasks. This hypothesis seems almost self-evident: training on relevant data should naturally improve relevant capabilities.」はですよねーとして、「Although explicit targeting might seem at odds with pretraining’s traditional emphasis on generality, our scaling analysis offers a reconciling insight: as compute increases, optimal filtering becomes predictably less strict. Smaller models perform best when trained on narrowly filtered datasets, while larger models benefit from more diverse data.」まで分析すると興味深い。
論文にも書かれていたが、多言語でどうなるかはとても興味がある。

WorldPM: Scaling Human Preference Modeling

WorldPM: Scaling Human Preference Modeling [130.2]
我々は、このスケーリングの可能性を強調するために、World Preference Modeling$ (WorldPM)を提案する。多様なユーザコミュニティをカバーする公開フォーラムから選好データを収集する。 1.5Bから72Bパラメータの範囲で15Mスケールのデータを用いて広範囲なトレーニングを行う。
論文参考訳（メタデータ） (Thu, 15 May 2025 17:38:37 GMT)
「Motivated by scaling laws in language modeling that demonstrate how test loss scales as a power law with model and dataset sizes, we find that similar laws exist in preference modeling.」とのこと。さらには「Through evaluations on 7 benchmarks with 20 subtasks, we find that WorldPM broadly improves the generalization performance across human preference datasets of varying sizes (7K, 100K and 800K samples), with performance gains exceeding 5% on many key subtasks.」を主張している。この手の基盤モデルの可能性は興味深い（が若干怖くもある）。
- Appendixのフィルタに関する結果、「we argue that applying RM filtering diverges from capturing world preference. Instead of assuming forum data contains noise, we should interpret apparent contradictions as manifestations of genuine human preferences, allowing models to discover underlying commonalities within these surface-level conflicts.」も面白い
リポジトリはGitHub – QwenLM/WorldPM

Scaling Laws of Synthetic Data for Language Models

Scaling Laws of Synthetic Data for Language Models [132.7]
プレトレーニングコーパスを多種多様な高品質な合成データセットに変換するスケーラブルなフレームワークであるSynthLLMを紹介した。提案手法は,グラフアルゴリズムを用いて複数の文書にまたがるハイレベルな概念を自動的に抽出し,再結合することで実現している。
論文参考訳（メタデータ） (Tue, 25 Mar 2025 11:07:12 GMT)
合成データのScaling lawに関する報告。高品質なデータ生成フレームワークSYnathLLMを前提に「Key findings from our extensive mathematical experiments on SYNTHLLM include: (1) SYNTHLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens.」と合成データの有効性を示唆する結論になっている。
プロジェクトサイトはAdvancing AI for Humanity。

Establishing Task Scaling Laws via Compute-Efficient Model Ladders

Establishing Task Scaling Laws via Compute-Efficient Model Ladders [123.8]
我々は,事前訓練された言語モデル(LM)のタスク性能を予測するために,タスクスケーリング法則とモデルはしごを開発する。まず、タスク固有の損失を予測するためにモデルとデータサイズを使用し、次にタスクの損失を使ってタスクパフォーマンスを予測する。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 18:21:49 GMT)
効率よくタスク性能を予測する手法の提案、「With a less than 1% of the pretraining compute, we are able to predict the task performance of 7B-4T and 13B-5T models on individual multiple-choice tasks with good accuracy.」とのこと。

Performance Law of Large Language Models

Performance Law of Large Language Models [58.3]
性能法則は、LLMアーキテクチャの選択と計算資源の効率的な割り当てを導くために用いられる。性能法則は、LLMアーキテクチャの選択と計算資源の効率的な割り当てを広範な実験なしで導くのに利用できる。
論文参考訳（メタデータ） (Mon, 19 Aug 2024 11:09:12 GMT)
計算式でMMLUスコアを直接予測、キーとなるのは「• The number of layers N • The hidden size h • The intermediate size d of FFN • The size of training data T (trillion tokens) • The model size S (billion parameters)」とのこと
面白いけどほんまかいな

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling [81.3]
生成したサンプルの数を増やすことで、別の軸として推論計算を探索する。すべての回答を自動的に検証できるコーディングや形式証明のようなドメインでは、カバレッジの増加は直接的にパフォーマンスの向上に変換される。多くの世代から正しいサンプルを同定することは、自動検証のない領域における将来の研究にとって重要な方向である。
論文参考訳（メタデータ） (Wed, 31 Jul 2024 17:57:25 GMT)
計算能力という面では、推論側でのScalingという話も
（この辺は良質な合成データとの関係性も気になる）

Scaling Laws of Synthetic Images for Model Training

Scaling Laws of Synthetic Images for Model Training … for Now [54.4]
本研究では, 合成画像のスケーリング法則について, テクスト・ツー・イメージ・モデルの現状から検討した。合成画像は、CLIPトレーニングの実際の画像と似ているが、やや効果の低いスケーリング傾向を示す。
論文参考訳（メタデータ） (Thu, 7 Dec 2023 18:59:59 GMT)
合成データを用いた時のスケーリング則の検証。合成データの利用は有望なアプローチである一方で不明点も多く、大規模検証はありがたい。「In supervised settings, synthetic data does not scale as effectively as real data.」というのはまぁそうだろうと思うが、「However, our study also highlights several scenarios where synthetic data proves advantageous: (1) In certain classes, synthetic data demonstrates better scaling behavior compared to real data; (2) Synthetic data is particularly effective when real data is scarce, for instance, in CLIP training with limited datasets; (3) Models trained on synthetic data may exhibit superior generalization to out-of-distribution data.」とのFindingsは重要。
リポジトリはGitHub – google-research/syn-rep-learn: Learning from synthetic data – code and models

Inverse Scaling

Inverse Scaling: When Bigger Isn’t Better [65.0]
大規模言語モデル(LM)は、スケールの増大による全体的な損失に対する予測可能な改善を示している。我々は,LMが逆スケーリングや,スケールの増大に伴うタスクパフォーマンスの悪化を示す可能性があるという主張を裏付ける証拠を示す。
論文参考訳（メタデータ） (Thu, 15 Jun 2023 20:11:23 GMT)
大規模言語モデルでTraining FLOPs（モデルパラメータとも相関）が拡大するにつれ通常とは逆にスコアが悪化するタスクの例と分析、 the Inverse Scaling Prize (§2)の分析
U字型だけでなく逆U字型のグラフになるタスクがあるのが興味深い。
リポジトリはGitHub – inverse-scaling/prize: A prize for finding tasks that cause large language models to show inverse scaling

機械翻訳におけるScaling Law

Scaling Laws for Multilingual Neural Machine Translation [45.6]
モデルサイズの増加がモデル性能に与える影響について検討し,スケーリング行動におけるトレーニング混合物組成の役割について検討した。学習混合物中の個々の言語ペアの重み付けの変化は,スケーリング法則の乗法的要因にのみ影響することがわかった。我々は、どんな言語重み付けでも訓練された多言語モデルの性能を予測するために、我々の観測を活用している。
論文参考訳（メタデータ） (Sun, 19 Feb 2023 18:43:24 GMT)
マルチリンガルな機械翻訳におけるScaling Lawの検証結果。興味深い結果が多いが近しい言語のマルチリンガル翻訳は効果が大きいという説に対して「(En→{De, Fr})への翻訳を訓練したモデルと、非関連言語(En→{De, Zh})で訓練したモデルのスケーリング挙動に有意な差はみられない。」という結果は面白い。
staka/takomt · Hugging Faceとか個人でやるには結構大変で当面はJA⇔ENに注力しようと思っているがとても面白い論文。

Scaling Laws for Generative Mixed-Modal Language Models

Scaling Laws for Generative Mixed-Modal Language Models [103.3]
個別のモダリティの貢献とそれら間の相互作用を統一する混合モードスケーリング法則について報告する。具体的には、過去のユニモーダルスケーリング法則に対する加算項として、データとモデルサイズによる最適シナジーと競合を明示的にモデル化する。また,訓練中に観察される4つの経験的現象,例えば,自然にモダリティを交互に交互に行う創発的コーディネート・アセット・スタイル・トレーニングを見出した。
論文参考訳（メタデータ） (Tue, 10 Jan 2023 00:20:06 GMT)

Scaling Laws vs Model Architectures

Scaling Laws vs Model Architectures: How does Inductive Bias Influence Scaling? [91.8]
本稿では,10種類のモデルアーキテクチャのスケーリング挙動の系統的研究を行う。アーキテクチャはスケーリングを行う上で重要な考慮事項であり、最高のパフォーマンスモデルが異なるスケールで変動可能であることを示す。
論文参考訳（メタデータ） (Thu, 21 Jul 2022 15:50:22 GMT)
- アーキテクチャによってスケーリング時の挙動が変わるかを調べた論文。大規模な実験でとても参考になる。直感通り「アーキテクチャはスケーリングを行う上で重要な考慮事項」とのこと。

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31