Synthetic data – ページ 4 – arXiv最新論文の紹介

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset [33.2]
精度とデータ量とのトレードオフを改善する方法を示します。 15Tトークンのためにトレーニングされた8Bパラメータモデルで、うち7.2Tは、Llama 3.1 8Bモデルよりも優れている。
論文参考訳（メタデータ） (Tue, 03 Dec 2024 17:28:50 GMT)
RedStone同様、Common CrawlをうまくRefineする手法の報告。こちらはNDIVIAによるもの。「We propose a method for transforming English Common Crawl into a 6.3T token longhorizon pretraining dataset, consisting of 4.4T globally deduplicated original tokens and 1.9T synthetically generated tokens.」と合成データについて触れられているのも興味深い。
プロジェクトサイトはNemotron-CC

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.4]
本稿では、推論と批判モデルの役割を分離する2人プレイヤパラダイムを提案する。まず、批判データを収集する自動化およびスケーラブルなフレームワークであるAutoMathCritiqueを提案する。テスト時間における難解なクエリに対するアクターのパフォーマンスを,批判モデルが一貫して改善することが実証された。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 17:11:54 GMT)
「flawed reasoning path construction, critique generation, and data filtering」の3ステージからなるフレームワークAutoMathCritiqueでデータを構築、fine tuningするとともに、「Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method」を適用して効果を確認。 critique modelの有効性が分かる結果に見える（が、この構築は容易ではないかもしれない）
リポジトリはAutoMathCritique

Training and Evaluating Language Models with Template-based Data Generation

Training and Evaluating Language Models with Template-based Data Generation [6.0]
我々は、700万以上の合成された小学校数学問題からなるデータセットを作成する。このデータセットは、数学的推論においてLLMを事前学習、微調整、評価するための貴重なリソースとして機能する。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 07:32:56 GMT)
LLMにメタテンプレート作成からまかせての合成データ構築。面白いけど他分野でもワークする可能性はあるのだろうか。
リポジトリはGitHub – iiis-ai/TemplateMath: Official implementation of “Training and Evaluating Language Models with Template-based Data Generation” (https://templatemath.github.io)

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch [28.5]
ScaleQuestはスケーラブルで斬新なデータ合成手法である。複雑な拡張制約を持つシードデータを必要とせずに、スクラッチから質問を生成する。主要なオープンソースモデルの性能を普遍的に向上させることができる。
論文参考訳（メタデータ） (Thu, 24 Oct 2024 12:42:04 GMT)
商用モデルでは広く利用されていると思われる、合成データを介してモデル性能を強化するフレームワークの提案。「 Our experiments demonstrate the model’s self-improvement capability, meaning that it can generate data of higher quality than its original training set.」という記載も興味深い。
リポジトリはGitHub – yyDing1/ScaleQuest: We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.

A Survey on Data Synthesis and Augmentation for Large Language Models

A Survey on Data Synthesis and Augmentation for Large Language Models [35.6]
本稿では,大規模言語モデルのライフサイクルを通じてデータ生成手法をレビューし,要約する。これらの手法が直面する現在の制約について考察し,今後の開発・研究の道筋について考察する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 16:12:39 GMT)
重要性が増すLLMに関するデータ合成のサーベイ

DocLayout-YOLO

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.3]
速度の優位性を保ちながら精度を向上させる新しいアプローチであるDoc-YOLOを導入する。堅牢な文書事前学習には、Mesh-candidate BestFitアルゴリズムを導入する。モデル最適化の観点からは,グローバルからローカライズ可能な受信モジュールを提案する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 14:50:47 GMT)
多様なレイアウトデータを合成する手法、Mesh-candidate BestFit methodologyの提案とそれを用いた高速高性能なDocLayout-YOLOの提案。
リポジトリはGitHub – opendatalab/DocLayout-YOLO: DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

GenSim2

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs [38.3]
GenSim2は、複雑で現実的なシミュレーションタスク作成のためのスケーラブルなフレームワークである。パイプラインは200のオブジェクトで最大100の調音タスクのデータを生成し、必要な人的労力を減らすことができる。我々は、生成したデータをゼロショット転送や実世界の収集データとの協調訓練に使用できる、GenSim2の有望な使用法を示す。
論文参考訳（メタデータ） (Fri, 04 Oct 2024 17:51:33 GMT)
(1) task proposal, (2) solver creation, (3) multi-task training, and (4) generalization evaluation and sim-to-real transfer.からなるフレームワークの提案。各所にLLM、MLLMを活用しながらデータ合成を行っていくアプローチ。（NLPのライブラリ gensimではない）
プロジェクトサイトはGenSim2: Scaling Robotic Data Generation with Multi-modal and Reasoning LLMs

GenSim: A General Social Simulation Platform with Large Language Model based Agents [110.4]
我々はtextitGenSim と呼ばれる新しい大規模言語モデル (LLM) ベースのシミュレーションプラットフォームを提案する。我々のプラットフォームは10万のエージェントをサポートし、現実世界のコンテキストで大規模人口をシミュレートする。我々の知る限り、GenSimは汎用的で大規模で修正可能な社会シミュレーションプラットフォームに向けた最初の一歩である。
論文参考訳（メタデータ） (Sun, 06 Oct 2024 05:02:23 GMT)
大規模なLLM based Agentのシミュレーションプラットフォーム（これもNLPのgemsimではない）
リポジトリはGitHub – TangJiakai/GenSim

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning [78.4]
Reflective Monte Carlo Tree Search (R-MCTS)は、AIエージェントの能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは1)従来のMCTSを拡張し、対照的な反射を取り入れ、エージェントは過去の相互作用から学ぶことができる。自己学習によりGPT-4oを微調整することでエージェントの性能を向上させる。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 21:42:35 GMT)
「We propose Reflective Monte Carlo Tree Search (R-MCTS), an extension of classic MCTS that improves the agent’s decision making process on the fly by incorporating reflection over its past task executions, and state estimations using multi-agent-debate」というタイプのモンテカルロ木探索の提案と、それによるSFTでベンチマーク結果を改善。ToTや単純なMCTSより優れた結果。
リポジトリはjasonyux/RMCTS-self-learning · GitHub

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale [97.2]
LLMは、デジタル環境と対話し、特定の目的を完遂する自律エージェントとして機能する。デジタルタスクに対する大規模な直接的なデモが欠如していることもあって、正確性はまだ十分ではない。我々は、この間接的な知識を大規模に直接監督するアプローチであるSynatraを提案する。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 00:51:45 GMT)
複雑なタスクを対象としてAgentがとるべき行動を合成するアプローチの提案。マニュアル等で「キーワードを入力する」と書かれているような曖昧な箇所をLLMで補間することが性能向上寄与するという話のよう。Agentの限界（人間との違い）を感じるとともに合成データの有効性、LLMの強力さを感じる。
「We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web.」と有効性を確認。「In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.1」コストパフォーマンスも優れる。
リポジトリはSynatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale (oootttyyy.github.io)

Qwen 2.5, Qwen 2 VL, GRIN-MoE, Pixtral

様々な研究機関がLLMを構築している。先週のニュースとしては高性能なLLM Qwen 2.5、MoE構成で高効率なGRIN-MoE、マルチモーダル拡張のQwen 2 VL、Pixtralに注目。

ライセンスは様々であることに注意が必要だが、モデル自体は公開されている。商用API以外に選択肢が広がっている。また、それぞれ様々な狙いを持ったモデルとなっていて正直評価を行うことも簡単ではない。自分がやりたいことにフィットするベースモデル、活用方法をサジェストするAIが欲しい今日この頃。

モデル構築、fine tuningの観点でも多くの情報が公開されておりとても興味深い。

Qwen2.5-Coder Technical Report [100.7]
先代のCodeQwen1.5から大幅にアップグレードされたQwen2.5-Coderシリーズを紹介します。コード固有のモデルとして、Qwen2.5-CoderはQwen2.5アーキテクチャに基づいて構築され、5.5兆以上のトークンからなる巨大なコーパスで事前訓練されている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:57:57 GMT)
「To ensure the quality of the pre-training data, we have curated a dataset by collecting public code data and extracting high-quality code-related content from web texts, while filtering out low-quality data using advanced classifiers.
」とフィルタリングの重要性を強調。データ合成にも触れられているがMATHと異なりリアルデータが豊富にあるから？

Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement [71.5]
Qwen2.5-Math と Qwen2.5-Math-Instruct-1.5B/7B/72B である。 Qwen2.5-Math-Instructは中国語と英語の両方をサポートし、高度な数学的推論能力を持っている。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 16:45:37 GMT)
「In this report, we introduce Qwen2.5-Math, which features several key technical highlights: (1) extensive use of synthesized mathematical data from Qwen2-Math during the pre-training phase, (2) iterative generation of fine-tuning data and reinforcement training guided by the reward model during the post-training and inference phase and (3) support for bilingual (English and Chinese) queries, along with chain-of-thought and tool-integrated reasoning capabilities.」と合成データとself improvement的な動きの効果が興味深い

Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution [82.4]
本稿では,従来のQwen-VLモデルのアップグレードであるQwen2-VLシリーズを紹介する。 Qwen2-VLでは、さまざまな解像度の画像を異なる数のビジュアルトークンに処理可能にする、Naive Dynamic Resolutionメカニズムが導入されている。また、Multimodal Rotary Position Embedding (M-RoPE)を統合し、テキスト、画像、ビデオ間で位置情報の効果的な融合を容易にする。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:59:32 GMT)
「Qwen2-VL series introduces naive dynamic resolution and multimodal rotary position embedding (M-RoPE) to fuse information across modals effectively and be capable of understanding videos over 20 minutes in length.」、「Furthermore, Qwen2-VL now supports understanding multilingual texts within images, including most European languages, Japanese, Korean, Arabic, Vietnamese, and others.」と動画対応、日本語対応と強力なマルチモーダルモデル。

GRIN: GRadient-INformed MoE [132.9]
Mixture-of-Experts (MoE)モデルは、エキスパートルーティングによるスパース計算により、密度の高いモデルよりも効果的にスケールする。エキスパートルーティングのためのスパース勾配推定を組み込んだGRIN(GRadient-Informed MoE Training)を導入する。我々のモデルは6.6Bの活性化パラメータしか持たないが、7Bの密度モデルより優れており、同じデータで訓練された14Bの密度モデルの性能と一致している。
論文参考訳（メタデータ） (Wed, 18 Sep 2024 17:00:20 GMT)
「We propose SparseMixer-v2 to estimate the gradient related to expert routing, while the conventional MoE training treats expert gating as a proxy for the gradient estimation.」、「We scale MoE training with neither expert parallelism nor token dropping, while the conventional MoE training employs expert parallelism and deploys token dropping.」を特徴とするMoEの改善
MoE構成でも意外とExpertにならないという報告を読んだ記憶があるが「Our study seems to verify our hypothesis that expert networks in GRIN MoE have developed highly-specialized and heterogeneous expertise.」という記載が興味深い。

Pixtral 12B [56.8]
12ビリオンパラメータのマルチモーダル言語モデルであるPixtral-12Bを導入する。 Pixtral-12Bは、自然画像と文書の両方を理解するために訓練されている。多くのオープンソースモデルとは異なり、Pixtralはそのサイズに対する最先端のテキストモデルでもある。
論文参考訳（メタデータ） (Wed, 09 Oct 2024 17:16:22 GMT)
Announcing Pixtral 12B | Mistral AI | Frontier AI in your hands
GitHub – mistralai/mistral-evals

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31