arXiv – ページ 2 – arXiv最新論文の紹介

Gemini 2.5, Deepseek V3, MCP …

週刊LLMが続いている。Gemini 2.5はGoogle Deepmindの最新モデルで非常に性能が高い（Gemini 2.5: Our newest Gemini model with thinking）。Humanity’s Last Examで18.8%と非常に難しいデータセットに対しても性能が上がっていっているのがすごい。Deepseek V3もアップデートが出ており当初のバージョンよりも性能が上がっている（DeepSeek-V3-0324 Release | DeepSeek API Docs、deepseek-ai/DeepSeek-V3-0324 · Hugging Face）。Gemma 3やQwen2.5 Omniのテクニカルレポートにも要注目。

LLM以外でもOpenAIのMCP対応（Model context protocol (MCP) – OpenAI Agents SDK）や画像生成AI（Introducing 4o Image Generation | OpenAI）などバズるニュースが多い。Reve AI | Next-Gen AI Image Generator with Reve Image 1.0など新たな動きもあり、本当に活発な分野である。

Gemma 3 Technical Report [198.3]
Gemma 3は、軽量オープンモデルのGemmaファミリに対するマルチモーダルな追加である。このバージョンでは、視覚理解能力、より広範な言語カバレッジ、より長いコンテキストが導入されている。また、長いコンテキストで爆発しがちなKVキャッシュメモリを減らすために、モデルのアーキテクチャを変更します。
論文参考訳（メタデータ） (Tue, 25 Mar 2025 15:52:34 GMT)

Qwen2.5-Omni Technical Report [31.0]
本稿では,テキスト,画像,音声,ビデオなど多様なモーダル性を認識するために,テキストと自然な音声応答を同時生成するエンド・ツー・エンドのマルチモーダルモデルを提案する。 Qwen2.5-OmniはOmni-Benchのようなマルチモーダルベンチマークで最先端のパフォーマンスを実現している。
論文参考訳（メタデータ） (Wed, 26 Mar 2025 04:17:55 GMT)

Scaling Laws of Synthetic Data for Language Models

Scaling Laws of Synthetic Data for Language Models [132.7]
プレトレーニングコーパスを多種多様な高品質な合成データセットに変換するスケーラブルなフレームワークであるSynthLLMを紹介した。提案手法は,グラフアルゴリズムを用いて複数の文書にまたがるハイレベルな概念を自動的に抽出し,再結合することで実現している。
論文参考訳（メタデータ） (Tue, 25 Mar 2025 11:07:12 GMT)
合成データのScaling lawに関する報告。高品質なデータ生成フレームワークSYnathLLMを前提に「Key findings from our extensive mathematical experiments on SYNTHLLM include: (1) SYNTHLLM generates synthetic data that reliably adheres to the rectified scaling law across various model sizes; (2) Performance improvements plateau near 300B tokens; and (3) Larger models approach optimal performance with fewer training tokens.」と合成データの有効性を示唆する結論になっている。
プロジェクトサイトはAdvancing AI for Humanity。

SynCity: Training-Free Generation of 3D Worlds

SynCity: Training-Free Generation of 3D Worlds [107.7]
テキスト記述から3次元世界を生成するためのトレーニング不要かつ最適化不要なアプローチであるSynCityを提案する。 3Dと2Dのジェネレータが組み合わさって、拡大するシーンを生成する方法を示す。
論文参考訳（メタデータ） (Thu, 20 Mar 2025 17:59:40 GMT)
どこかで聞いたことのあるような論文タイトル。色々とうまく組み合わせている印象の手法だが、作例が面白い。
リポジトリはSynCity: Training-Free Generation of 3D Worlds

Analyzing the Usage of Donation Platforms for PyPI Libraries

Analyzing the Usage of Donation Platforms for PyPI Libraries [92.0]
本研究では,PyPIエコシステムにおける寄付プラットフォームの導入状況について分析した。 GitHub Sponsorsが支配的なプラットフォームであるが、多くのPyPIリストのリンクは時代遅れである。
論文参考訳（メタデータ） (Tue, 11 Mar 2025 10:27:31 GMT)
Pythonライブラリへの寄付に関する分析。「From a library perspective, we discovered that donation platform links are mostly missing on PyPI project pages, with a clear tendency to list them on GitHub repositories instead. GitHub Sponsors stands out as the primary donation platform across PyPI and GitHub.」はそうだろうなーという感じ。
「Recent research highlights the strong connection between OSS maintenance activities and financial support.」もあるが、便利に利用しているものについては寄付の文化が広がってほしいところ。

Measuring AI Ability to Complete Long Tasks

Measuring AI Ability to Complete Long Tasks [6.0]
人間が通常、AIモデルが達成できるタスクを完了するのに要する時間を50%の成功率で測定します。 Claude 3.7 Sonnetのような現在のフロンティアAIモデルは50分程度で50%タイムの地平線を持つ。 AIモデルの時間的地平線の増加は、より信頼性が高く、ミスに適応する能力によって引き起こされているように思われる。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 17:59:31 GMT)
「the time humans typically take to complete tasks that AI models can complete with 50% success rate」を定義とする「50%-task-completion time horizon」というメトリクスの提案と検討。「On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes」、「Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024.」とのこと。
どのくらいの規模のソフトウェアを自動生成できるのか？という意味では参考になる指標だと思う。「Finally, we attempt to extrapolate the trend on our tasks to one-month (167 hours) AI (Section 7.1), finding that if both the trend continues and observed performance trends generalize to real-world tasks, an 80% confidence interval for the release date of AI that can complete 1-month long software tasks spans from late 2028 to early 2031」をどう評価するかは悩ましいが、人が一か月かけて開発するレベルのソフトウェアが自動生成できるようになるかも、というのはそうかもしれないという感覚もある。

A Survey on Trustworthy LLM Agents: Threats and Countermeasures

A Survey on Trustworthy LLM Agents: Threats and Countermeasures [67.2]
大規模言語モデル(LLM)とマルチエージェントシステム(MAS)はLLMエコシステムの機能を大幅に拡張した。本稿では,エージェントの信頼性に関する総合的研究であるTrustAgentフレームワークを提案する。
論文参考訳（メタデータ） (Wed, 12 Mar 2025 08:42:05 GMT)
LLM based Agentを intrinsic (brain, memory, and tool) とextrinsic (user, agent, and environment)な側面から見た信頼性のサーベイ
リポジトリはGitHub – Ymm-cll/TrustAgent

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding [40.5]
MDocAgentは、テキストとイメージの両方を活用する新しいRAGおよびマルチエージェントフレームワークである。本システムでは, 汎用エージェント, クリティカルエージェント, テキストエージェント, 画像エージェント, 要約エージェントの5種類の特殊エージェントを用いる。 5つのベンチマークの予備実験では、MDocAgentの有効性が示され、平均12.1%の改善が達成された。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 06:57:21 GMT)
非常に凝った構成のRAG（AgenticRAG）
リポジトリはGitHub – aiming-lab/MDocAgent: MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation

MMLU-ProX: A Multilingual Benchmark for Advanced Large Language Model Evaluation [60.5]
MMLU-ProXは、言語毎に約11,829の質問を持つ、13の型的多様言語をカバーする包括的なベンチマークである。 5ショットチェーン(CoT)とゼロショットプロンプト戦略を用いて25の最先端の大規模言語モデル(LLM)を評価し,言語的・文化的境界を越えてその性能を解析した。我々の実験は、ハイリソース言語から低リソース言語への一貫したパフォーマンス劣化を示し、最高のモデルは英語で70%以上の精度を達成しているが、Swahiliのような言語では40%程度にまで低下している。
論文参考訳（メタデータ） (Thu, 13 Mar 2025 15:59:20 GMT)
「MMLU-ProX extends the challenging MMLU-Pro benchmark to encompass 13 typologically diverse languages: English (EN), Chinese (ZH), Japanese (JA), Korean (KO), French (FR), German (DE), Spanish (ES), Portuguese (PT), Arabic (AR), Thai (TH), Hindi (HI), Bengali (BN), and Swahili (SW).」、「By carefully translating the same set of questions across all languages, MMLU-ProX facilitates direct comparison of model performance across linguistic boundaries while controlling for question difﬁculty.」というベンチマーク。多言語で評価可能なベンチマークを使うと言語間差異がよくわかる。
プロジェクトサイトはMMLU-ProX: A Multilingual Benchmark for Advanced LLM Evaluation

EnvBench: A Benchmark for Automated Environment Setup

EnvBench: A Benchmark for Automated Environment Setup [76.0]
大規模言語モデルにより、研究者はソフトウェア工学領域における実用的なリポジトリレベルのタスクに集中できるようになった。環境設定に関する既存の研究は革新的なエージェント戦略を導入しているが、その評価は小さなデータセットに基づいていることが多い。このギャップに対処するため、包括的環境設定ベンチマークEnvBenchを紹介します。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 17:19:12 GMT)
環境設定に関するベンチマーク。実用上はとても大事で状況によってはコード生成よりうれしいことがあるかもしれない。。
エージェントを使ってなおスコアが低い難しいベンチマークのよう。
リポジトリはGitHub – JetBrains-Research/EnvBench: [DL4C @ ICLR 2025] A Benchmark for Automated Environment Setup、🌱⚙️ EnvBench – a JetBrains-Research Collection

Cosmos World Foundation Model Platform for Physical AI

Cosmos World Foundation Model Platform for Physical AI [136.1]
私たちは、開発者が物理AIセットアップのためにカスタマイズされた世界モデルを構築するのを助けるために、Cosmos World Foundation Model Platformを紹介します。我々のプラットフォームは、ビデオキュレーションパイプライン、事前訓練された世界ファンデーションモデル、事前訓練された世界ファンデーションモデルのポストトレーニング例、ビデオトークン化ツールをカバーしています。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 16:59:07 GMT)
物理世界の理解と推論のためのマルチモーダルモデル、Cosmos-Reason1の提案。「In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e g , next step action) in natural language through long chain-of-thought reasoning processes.」「With Physical AI SFT and RL, Cosmos-Reason1 can learn intuitive physics, such as the arrow of time and object permanence, which existing models struggle with.」とCoTなLRMに似た構成。確かにこの分野に対してReasoning modelは有効そう。
リポジトリはGitHub – nvidia-cosmos/cosmos-reason1: Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.

Cosmos-Transfer1: Conditional World Generation with Adaptive Multimodal Control [98.2]
複数の空間制御入力に基づいて世界シミュレーションを生成する条件付き世界生成モデルであるCosmos-Transferを導入する。提案したモデルを解析し,ロボット2Realや自律走行車データ豊かさを含む物理AIへの応用を実証するために評価を行う。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 17:57:54 GMT)
こちらも注目の「diffusion-based conditional world model for multimodal controllable world generation」
リポジトリはGitHub – nvidia-cosmos/cosmos-transfer1: Cosmos-Transfer1 is a world-to-world transfer model designed to bridge the perceptual divide between simulated and real-world environments.

2025年4月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30