arXiv最新論文の紹介

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.8]
AIエージェントのための大規模アクションモデル(LAM)は、素晴らしいポテンシャルを提供するが、高品質なトレーニングデータを必要とするため、課題に直面している。 LAM SIMULATORは,高品質なフィードバックによるエージェントタスクのオンライン探索を目的とした総合的なフレームワークである。本フレームワークは,動的タスククエリジェネレータ,広範囲なツールコレクション,および大規模言語モデル(LLM)エージェントがツールを呼び出し,リアルタイムフィードバックを受信できる対話型環境を備えている。
論文参考訳（メタデータ） (Mon, 02 Jun 2025 22:36:02 GMT)
LAM SIMULATOR, a comprehensive frame- work designed for online exploration of agentic tasks with high-quality feedback

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model [21.8]
SridBenchは、科学フィギュア生成のための最初のベンチマークである。これは13の自然科学とコンピュータ科学の分野にわたる主要な科学論文から1,120の事例で構成されている。その結果、GPT-4o画像のような最上位モデルでさえ、人間のパフォーマンスに遅れがあることが判明した。
論文参考訳（メタデータ） (Wed, 28 May 2025 08:51:01 GMT)
科学的な図の生成に関するベンチマーク作成とその検証。データは公開されていない？
「We found that, with the exception of GPT-4o-image, other image generation models, such as Gemini- 2.0-Flash, do not have any scientific mapping capabilities.」とのこと。。

Pitfalls in Evaluating Language Model Forecasters

Pitfalls in Evaluating Language Model Forecasters [45.4]
我々はコミュニティとして、大きな言語モデルを評価するような結論に注意する必要があると論じている。 1) 時間的リークによる評価結果の信頼の難しさ,(2) 評価性能から実世界の予測への外挿の難しさ,の2つのカテゴリを識別する。
論文参考訳（メタデータ） (Sat, 31 May 2025 21:49:17 GMT)
LLMの評価に関する落とし穴をまとめた論文
「We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims.」というまとめだが、評価は本当に難しい。

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [104.0]
モデルマージは、複数のエキスパートモデルを単一のモデルにまとめることを目的としており、ストレージとサービスコストを削減している。これまでの研究は主に、コードと数学のタスクに視覚分類モデルやLLM(Large Language Models)を統合することに焦点を当ててきた。本稿では,VQA,Geometry,Chart,OCR,Gundingといった複数のタスクを含むMLLMのモデルマージベンチマークを紹介する。
論文参考訳（メタデータ） (Mon, 26 May 2025 12:23:14 GMT)
マルチモーダルなモデルマージに関するベンチマークの紹介。
リポジトリはGitHub – WalkerWorldPeace/MLLMerging: Official implementation of “Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging”.

XToM: Exploring the Multilingual Theory of Mind for Large Language Models

XToM: Exploring the Multilingual Theory of Mind for Large Language Models [58.0]
LLMにおける既存の心の理論の評価は英語に限られている。 XToMは5言語にまたがってToMを評価する,厳格に検証された多言語ベンチマークである。以上の結果から,LLMが言語的文脈にまたがって人間的なメンタライゼーションを再現する能力に限界があることが判明した。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 05:23:25 GMT)
多言語でのLLM比較、「LLMs are equipped with multilingual understanding ability but fail in multi- lingual ToM reasoning tasks.」と結論。深い部分での言語間差異は残っているよう（とはいえ、一昔前に比べて差異は縮小しているようにも見える）
リポジトリはGitHub – HKUST-KnowComp/XToM: Data and Code for paper “X-ToM: Exploring the Multilingual Theory of Mind for Large Language Models”

A Survey of LLM × DATA

A Survey of LLM $\times$ DATA [72.0]
大規模言語モデル(LLM)とデータ管理(Data4LLM)の統合は、両方のドメインを急速に再定義しています。一方、Data data4LLMは、事前トレーニング、後トレーニング、検索強化生成、エージェント生成などの段階に必要なデータの高品質、多様性、タイムラインをLLMに提供する。一方、LLMはデータ管理のための汎用エンジンとして登場しつつある。
論文参考訳（メタデータ） (Sat, 24 May 2025 01:57:12 GMT)
データを軸としたサーベイ。
リポジトリとしてGitHub – weAIDB/awesome-data-llm: Official Repository of “LLM × DATA” Survey Paperがあり、数多くの論文がリンクされている。

Self-Challenging Language Model Agents

Self-Challenging Language Model Agents [98.6]
本稿では,エージェントが自ら生成する高品質なタスクについて,エージェントを訓練するためのセルフチェンジフレームワークを提案する。このフレームワークは、Llama-3.1-8B-Instructの2倍の改善を実現している。
論文参考訳（メタデータ） (Mon, 02 Jun 2025 14:23:33 GMT)
「we present the Self-Challenging Agent (SCA) method for self-improvement of general multi-turn tool-use LLM agents. SCA can create its own tasks to challenge itself and learn from them. To do this, it utilizes the Code-as-Task (CaT) formulation which ensures high quality synthetic tasks. Through RL on these self-generated synthetic tasks, SCA can be used to train a Llama-3.1-8B model to achieve an average relative success rate improvement of 95.8% on existing test tasks across four different multi-turn tool-use environments.」とのこと。。。AGIに近づいている感のある未来を感じる報告。（「While SCA serves as a preliminary step, there remains many research questions for building an effective self-improvement flywheel for general LLM agents.」とあるとおり、実態上はまだいろいろ壁はあるのだろうが）
コード生成を効果的に使っているのも興味深いが、形式言語で表されるようなタスクは解ける段階というのは意外と早く来るのだろうか。。。

OpenThoughts: Data Recipes for Reasoning Models

OpenThoughts: Data Recipes for Reasoning Models [215.2]
OpenThoughtsプロジェクトは、推論モデルをトレーニングするためのオープンソースのデータセットを作成することだ。 OpenThoughts2-1Mデータセットは、公開推論データに基づいてトレーニングされた最初のモデルであるOpenThinker2-32Bに導かれた。 OpenThinker3-7Bモデル。
論文参考訳（メタデータ） (Wed, 04 Jun 2025 17:25:39 GMT)
LRM構築のためのオープンデータセット。データ拡張の方向性としても参考になる。
プロジェクトサイトはOpen Thoughts

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding [114.5]
画素レベルの部分接地のために設計された LMM ベンチマークである PartONOMY を紹介する。我々はいくつかの部分中心LMMをトレーニングし、セグメント化トークンの代わりにスパンタグを使用する新しいセグメント化LMMであるPLUMを提案する。我々の研究は、LMMにおけるきめ細かい基礎的な視覚的理解を実現するための新たな道を開く。
論文参考訳（メタデータ） (Tue, 27 May 2025 06:03:56 GMT)
「Unfortunately, Large Multimodal Models (LMMs), the backbones of today’s multimodal systems, lack strong part recognition abilities 」とのことで、それを検証するベンチマークと改善モデルPLUM: Part-Level Understanding LMMを提案。
リポジトリはGitHub – AnselBlume/partonomy: Repository for “Partonomy: Large Multimodal Models with Part-Level Visual Understanding”

Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes

Ctrl-Crash: Controllable Diffusion for Realistic Car Crashes [26.7]
Ctrl-Crashはコントロール可能なカークラッシュビデオ生成モデルで、バウンディングボックス、クラッシュタイプ、初期画像フレームなどの信号を条件付けする。提案手法は,入力の微妙な変化がクラッシュ結果の劇的な変化を引き起こすような,現実的なシナリオ生成を可能にする。
論文参考訳（メタデータ） (Fri, 30 May 2025 21:04:38 GMT)
「we introduce Ctrl-Crash, a controllable video diffusion framework for generating realistic crash videos from a single initial frame. Our method operates with inputs and outputs in pixel space, as opposed to using computer graphics primitives and explicit models of physics.」
様々なシチュエーションを考える上では有効そうには思う
リポジトリはCtrl-Crash: Controllable Diffusion for Realistic Car Crashes

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31