データセット – arXiv最新論文の紹介

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning [24.7]
1kの大学レベルの教科書から抽出した真正な参照回答を特徴とするオープンデータセットであるTextbookReasoningを提案する。私たちは、合計125万のインスタンスからなる高品質なオープンソースデータセットの大規模な混合であるMegaScienceを紹介します。実験により,我々のデータセットはより簡潔な応答長で優れた性能と訓練効率が得られることを示した。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 17:59:03 GMT)
「We present TEXTBOOKREASONING and MEGASCIENCE, two datasets that advance the frontier in the scientific domain by enabling base models to outperform official instruct models on scientific tasks when fine-tuned with our data.」
リポジトリはGAIR-NLP/MegaScience: MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning、MegaScience (MegaScience)

PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process

PPJudge: Towards Human-Aligned Assessment of Artistic Painting Process [15.4]
本研究では,絵画プロセスの人為的アセスメントのための新しい枠組みを提案する。具体的には、実画像と合成画像からなる最初の大規模データセットであるペイントプロセスアセスメントデータセット(PPAD)を紹介する。また、時間的に認識された位置符号化を付加したトランスフォーマーベースモデルPPJudgeを提案する。
論文参考訳（メタデータ） (Sat, 12 Jul 2025 10:30:44 GMT)
「we introduce a dataset specifically designed for painting process assessment: the Painting Process Assessment Dataset (PPAD). It consists of approximately 15,000 real paintings and 10,000 synthetic paintings, each annotated by domain experts.」というデータセットと対応するモデルの提案。

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [112.5]
4000時間以上の対面インタラクション映像の大規模な収集であるSeamless Interactionデータセットを紹介した。このデータセットは、ダイドの具体的ダイナミクスを理解するAIテクノロジの開発を可能にする。そこで我々は,このデータセットを用いて,人間の発話に適応した動作ジェスチャーと表情を生成するモデル群を開発した。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 18:09:49 GMT)
「we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools.」というデータセット。
リポジトリはGitHub – facebookresearch/seamless_interaction: Foundation Models and Data for Human-Human and Human-AI interactions.

FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language

FineWeb2: One Pipeline to Scale Them All — Adapting Pre-Training Data Processing to Every Language [48.8]
我々は、FineWebをベースにした、新しいトレーニング済みデータセットキュレーションパイプラインを導入する。我々のパイプラインは、以前のデータセットよりもパフォーマンスの高いモデルを生成する非英語コーパスを作成するために使用できることを示す。パイプラインを約100のCommon Crawlスナップショットを使用して1000以上の言語に拡張し、新たに20テラバイト(50億ドキュメント)のマルチリンガルデータセットであるFinWeb2を生成しました。
論文参考訳（メタデータ） (Thu, 26 Jun 2025 01:01:47 GMT)
大規模、マルチリンガル、高品質なデータセットの提案。重複データへの対応やフィルタリングによって他のデータセットよりも効率的な学習が可能とのこと
リポジトリはGitHub – huggingface/fineweb-2、データセットはHuggingFaceFW/fineweb-2 · Datasets at Hugging Face

Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability

Institutional Books 1.0: A 242B token dataset from Harvard Library’s collections, refined for accuracy and usability [1.3]
Institutional Books 1.0は、2006年からHarvard LibraryのGoogle Booksプロジェクトへの参加を通じてデジタル化されたパブリックドメインブックのコレクションである。ハーバード図書館で作業し、これらの論文を抽出し、分析し、処理し、歴史文書の広範囲に記録されたデータセットにしました。この分析は、当初250以上の異なる言語で書かれた1,075,899巻に及ぶ、約250億個のトークンをスキャンしたハーバード図書館のコレクション全体をカバーしている。
論文参考訳（メタデータ） (Tue, 10 Jun 2025 00:11:30 GMT)
「OCR-extracted text (original and post-processed) as well as the metadata (bibliographic, source, and generated) of the 983,004 volumes, or 242B tokens, identified as being in the public domain have been made available.」という大規模データ
データセットはinstitutional/institutional-books-1.0 · Datasets at Hugging Face、リポジトリはGitHub – instdin/institutional-books-1-pipeline: The Institutional Data Initiative’s pipeline for analyzing, refining, and publishing the Institutional Books 1.0 collection.

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text

The Common Pile v0.1: An 8TB Dataset of Public Domain and Openly Licensed Text [81.0]
オープンライセンスの8テラバイトのテキストコレクションであるCommon Pile v0.1を収集、キュレート、リリースしています。 Common Pileは、研究論文、コード、書籍、百科事典、教育資料、オーディオ書き起こしなど、さまざまな分野にまたがる30のソースからのコンテンツで構成されている。我々は,コモンパイルからテキストで20億のパラメータLSMをトレーニングすることで,我々の努力を検証する。
論文参考訳（メタデータ） (Thu, 05 Jun 2025 16:21:30 GMT)
「We release Common Pile v0.1, an 8TB corpus that—to our knowledge—constitutes the largest dataset built exclusively from openly licensed text. 」というクリーンなデータセット構築と競争力のあるモデル構築の検証。「Our results demonstrate that not only is the Common Pile the strongest dataset for pretraining under an open-license constraint, but also that it produces models comparable to those trained on an equivalent amount of unlicensed data. This positive result holds promise for future of open-license pretraining, especially if the research community invests in collecting larger quantities of openly licensed text data in the future.」とのこと。
非常に意義のある取り組みだと思う、
データセットはCommon Pile v0.1 Raw Data – a common-pile Collection、リポジトリはGitHub – r-three/common-pile: Code for collecting, processing, and preparing datasets for the Common Pile

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [104.0]
モデルマージは、複数のエキスパートモデルを単一のモデルにまとめることを目的としており、ストレージとサービスコストを削減している。これまでの研究は主に、コードと数学のタスクに視覚分類モデルやLLM(Large Language Models)を統合することに焦点を当ててきた。本稿では,VQA,Geometry,Chart,OCR,Gundingといった複数のタスクを含むMLLMのモデルマージベンチマークを紹介する。
論文参考訳（メタデータ） (Mon, 26 May 2025 12:23:14 GMT)
マルチモーダルなモデルマージに関するベンチマークの紹介。
リポジトリはGitHub – WalkerWorldPeace/MLLMerging: Official implementation of “Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging”.

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts: Data Recipes for Reasoning Models [215.2]
OpenThoughtsプロジェクトは、推論モデルをトレーニングするためのオープンソースのデータセットを作成することだ。 OpenThoughts2-1Mデータセットは、公開推論データに基づいてトレーニングされた最初のモデルであるOpenThinker2-32Bに導かれた。 OpenThinker3-7Bモデル。
論文参考訳（メタデータ） (Wed, 04 Jun 2025 17:25:39 GMT)
LRM構築のためのオープンデータセット。データ拡張の方向性としても参考になる。
プロジェクトサイトはOpen Thoughts

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models

ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models [37.5]
視覚的推論によってのみ解決可能な合成データセットを用いてケーススタディを行う。次に、1,162人の専門家が注釈を付けた質問を含む新しいチャート質問回答(QA)ベンチマークであるChartMuseumを紹介します。人間は93%の精度を達成しているが、最高のパフォーマンスモデルであるGemini-2.5-Proは63.0%しか達成できず、主要なオープンソースであるLVLM Qwen2.5-VL-72B-Instructは38.5%しか達成していない。
論文参考訳（メタデータ） (Mon, 19 May 2025 17:59:27 GMT)
チャートQAなベンチマーク。Gemini-2.5-Pro、o4, o3, Calude 3.7, GPT-4.1もスコアが低い困難なタスク。
プロジェクトサイトはChartMuseum

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

TongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials [70.1]
リッチなマルチモーダルWebチュートリアルから学習し,汎用GUIエージェントを構築するTongUIフレームワークを提案する。我々は、5つのオペレーティングシステムと200以上のアプリケーションにまたがる143Kトラジェクトリデータを含むGUI-Netデータセットを作成する。我々はGUI-Net上でQwen2.5-VL-3B/7Bモデルを微調整してTongUIエージェントを開発する。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 06:15:56 GMT)
WEBチュートリアルを活用したデータセット構築とfine tuningによるエージェント開発
プロジェクトサイトはTongUI: Building Generalized GUI Agents by Learning from Multimodal Web Tutorials

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31