Synthetic data – ページ 5 – arXiv最新論文の紹介

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models [12.9]
データ品質,多様性,複雑性の観点から,各アルゴリズムが生成した合成データの構成によるアルゴリズムの評価を行った。合成データパイプラインにおける各種成分が各データ特性に与える影響について検討する。これらのトレードオフのバランスは、将来の自己改善アルゴリズムの開発に不可欠である、と我々は主張する。
論文参考訳（メタデータ） (Wed, 04 Dec 2024 02:47:45 GMT)
合成データに関するQuality、Diversity、Complexityからのサーベイ。「Overall, we found that domain specific, attribute measures utilizing LLMs-as-a-judge provide the best measures in complex tasks and domains in terms of correlation with downstream metrics.」という記載が興味深いところ。

Phi4, InternVL 2.5, EXAONE 3.5

Gemini 2.0やOpenAIの12日間発表で盛り上がっているが、OSSや公開モデルについても様々なモデルが発表されている。

Phi-4 Technical Report [72.1]
本研究では,データ品質に重点を置いた14ビリオンパラメータ言語モデル phi-4 を提案する。多くの言語モデルとは異なり、事前学習は主にWebコンテンツやコードなどの有機データソースに基づいており、phi-4はトレーニングプロセス全体を通して戦略的に合成データを組み込んでいる。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 03:37:41 GMT)
小型、高性能モデルPhiの最新バージョン、「phi-4 strategically incorporates synthetic data throughout the training process.」とのことで合成データをうまく活用するアプローチ。Phi3を超え、GPT-4o miniに迫っている優秀なモデル。
公式Blogでも発表がある　Introducing Phi-4: Microsoft’s Newest Small Language Model Specializing in Complex Reasoning | Microsoft Community Hub

EXAONE 3.5: Series of Large Language Models for Real-world Use Cases [35.0]
EXAONE 3.5言語モデルは32B、7.8B、2.4Bの3つの構成で提供されている。商用利用については、LG AI Researchの公式コンタクトポイントを参照してください。
論文参考訳（メタデータ） (Mon, 09 Dec 2024 09:31:10 GMT)
LGによる公開モデル、同サイズのQwen2.5と競合する性能
リポジトリはLGAI-EXAONE (LG AI Research)

Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling [121.1]
InternVL 2.5は、InternVL 2.0上に構築された高度マルチモーダル大規模言語モデル(MLLM)シリーズである。 InternVL 2.5は、GPT-4oやClaude-3.5-Sonnetといった主要な商用モデルと競合する競争力を持つ。このモデルが、マルチモーダルAIシステムの開発と適用のための新しい標準を設定することで、オープンソースコミュニティに貢献できることを願っています。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 18:57:08 GMT)
OSSのMLLM、性能は商用モデルと競合的とのこと。「we integrate a newly incrementally pre-trained InternViT with various pre-trained LLMs, including InternLM 2.5 and Qwen 2.5, using a randomly initialized MLP projector.」というアーキテクチャでViTをProjectorでLLMとつなぐアプローチ
リポジトリはOpenGVLab/InternVL2_5-78B · Hugging Face、GitHub – OpenGVLab/InternVL: [CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型

InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions [104.9]
本研究は,ストリーミング映像とオーディオ入力とのリアルタイムインタラクションを実現するために,非絡み合いのストリーミング知覚,推論,メモリ機構を導入している。このプロジェクトは人間のような認知をシミュレートし、多モーダルな大規模言語モデルが時間とともに継続的かつ適応的なサービスを提供できるようにする。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:58:30 GMT)
リアルタイムストリーミングだけでなくメモリ機能なども備えるフレームワーク
リポジトリはGitHub – InternLM/InternLM-XComposer: InternLM-XComposer2.5-OmniLive: A Comprehensive Multimodal System for Long-term Streaming Video and Audio Interactions

Owl-1: Omni World Model for Consistent Long Video Generation [75.5]
Omni World ModeL (Owl-1) を提案する。 Owl-1 は VBench-I2V と VBench-Long の SOTA メソッドと同等の性能を実現している。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 18:59:01 GMT)
動画生成モデル、リポジトリはGitHub – huang-yh/Owl

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset

Nemotron-CC: Transforming Common Crawl into a Refined Long-Horizon Pretraining Dataset [33.2]
精度とデータ量とのトレードオフを改善する方法を示します。 15Tトークンのためにトレーニングされた8Bパラメータモデルで、うち7.2Tは、Llama 3.1 8Bモデルよりも優れている。
論文参考訳（メタデータ） (Tue, 03 Dec 2024 17:28:50 GMT)
RedStone同様、Common CrawlをうまくRefineする手法の報告。こちらはNDIVIAによるもの。「We propose a method for transforming English Common Crawl into a 6.3T token longhorizon pretraining dataset, consisting of 4.4T globally deduplicated original tokens and 1.9T synthetically generated tokens.」と合成データについて触れられているのも興味深い。
プロジェクトサイトはNemotron-CC

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision

Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision [120.4]
本稿では、推論と批判モデルの役割を分離する2人プレイヤパラダイムを提案する。まず、批判データを収集する自動化およびスケーラブルなフレームワークであるAutoMathCritiqueを提案する。テスト時間における難解なクエリに対するアクターのパフォーマンスを,批判モデルが一貫して改善することが実証された。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 17:11:54 GMT)
「flawed reasoning path construction, critique generation, and data filtering」の3ステージからなるフレームワークAutoMathCritiqueでデータを構築、fine tuningするとともに、「Motivated by the insights of test-time, we introduce the critique model into the actor model’s exploration and learning process, introducing a critique-in-the-loop self-improvement method」を適用して効果を確認。 critique modelの有効性が分かる結果に見える（が、この構築は容易ではないかもしれない）
リポジトリはAutoMathCritique

Training and Evaluating Language Models with Template-based Data Generation

Training and Evaluating Language Models with Template-based Data Generation [6.0]
我々は、700万以上の合成された小学校数学問題からなるデータセットを作成する。このデータセットは、数学的推論においてLLMを事前学習、微調整、評価するための貴重なリソースとして機能する。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 07:32:56 GMT)
LLMにメタテンプレート作成からまかせての合成データ構築。面白いけど他分野でもワークする可能性はあるのだろうか。
リポジトリはGitHub – iiis-ai/TemplateMath: Official implementation of “Training and Evaluating Language Models with Template-based Data Generation” (https://templatemath.github.io)

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch

Unleashing Reasoning Capability of LLMs via Scalable Question Synthesis from Scratch [28.5]
ScaleQuestはスケーラブルで斬新なデータ合成手法である。複雑な拡張制約を持つシードデータを必要とせずに、スクラッチから質問を生成する。主要なオープンソースモデルの性能を普遍的に向上させることができる。
論文参考訳（メタデータ） (Thu, 24 Oct 2024 12:42:04 GMT)
商用モデルでは広く利用されていると思われる、合成データを介してモデル性能を強化するフレームワークの提案。「 Our experiments demonstrate the model’s self-improvement capability, meaning that it can generate data of higher quality than its original training set.」という記載も興味深い。
リポジトリはGitHub – yyDing1/ScaleQuest: We introduce ScaleQuest, a scalable, novel and cost-effective data synthesis method to unleash the reasoning capability of LLMs.

A Survey on Data Synthesis and Augmentation for Large Language Models

A Survey on Data Synthesis and Augmentation for Large Language Models [35.6]
本稿では,大規模言語モデルのライフサイクルを通じてデータ生成手法をレビューし,要約する。これらの手法が直面する現在の制約について考察し,今後の開発・研究の道筋について考察する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 16:12:39 GMT)
重要性が増すLLMに関するデータ合成のサーベイ

DocLayout-YOLO

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.3]
速度の優位性を保ちながら精度を向上させる新しいアプローチであるDoc-YOLOを導入する。堅牢な文書事前学習には、Mesh-candidate BestFitアルゴリズムを導入する。モデル最適化の観点からは,グローバルからローカライズ可能な受信モジュールを提案する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 14:50:47 GMT)
多様なレイアウトデータを合成する手法、Mesh-candidate BestFit methodologyの提案とそれを用いた高速高性能なDocLayout-YOLOの提案。
リポジトリはGitHub – opendatalab/DocLayout-YOLO: DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

GenSim2

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs [38.3]
GenSim2は、複雑で現実的なシミュレーションタスク作成のためのスケーラブルなフレームワークである。パイプラインは200のオブジェクトで最大100の調音タスクのデータを生成し、必要な人的労力を減らすことができる。我々は、生成したデータをゼロショット転送や実世界の収集データとの協調訓練に使用できる、GenSim2の有望な使用法を示す。
論文参考訳（メタデータ） (Fri, 04 Oct 2024 17:51:33 GMT)
(1) task proposal, (2) solver creation, (3) multi-task training, and (4) generalization evaluation and sim-to-real transfer.からなるフレームワークの提案。各所にLLM、MLLMを活用しながらデータ合成を行っていくアプローチ。（NLPのライブラリ gensimではない）
プロジェクトサイトはGenSim2: Scaling Robotic Data Generation with Multi-modal and Reasoning LLMs

GenSim: A General Social Simulation Platform with Large Language Model based Agents [110.4]
我々はtextitGenSim と呼ばれる新しい大規模言語モデル (LLM) ベースのシミュレーションプラットフォームを提案する。我々のプラットフォームは10万のエージェントをサポートし、現実世界のコンテキストで大規模人口をシミュレートする。我々の知る限り、GenSimは汎用的で大規模で修正可能な社会シミュレーションプラットフォームに向けた最初の一歩である。
論文参考訳（メタデータ） (Sun, 06 Oct 2024 05:02:23 GMT)
大規模なLLM based Agentのシミュレーションプラットフォーム（これもNLPのgemsimではない）
リポジトリはGitHub – TangJiakai/GenSim

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning [78.4]
Reflective Monte Carlo Tree Search (R-MCTS)は、AIエージェントの能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは1)従来のMCTSを拡張し、対照的な反射を取り入れ、エージェントは過去の相互作用から学ぶことができる。自己学習によりGPT-4oを微調整することでエージェントの性能を向上させる。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 21:42:35 GMT)
「We propose Reflective Monte Carlo Tree Search (R-MCTS), an extension of classic MCTS that improves the agent’s decision making process on the fly by incorporating reflection over its past task executions, and state estimations using multi-agent-debate」というタイプのモンテカルロ木探索の提案と、それによるSFTでベンチマーク結果を改善。ToTや単純なMCTSより優れた結果。
リポジトリはjasonyux/RMCTS-self-learning · GitHub

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31