arXiv最新論文の紹介

注目

このサイトについて

「Fugu-MT: arxivの論文翻訳」から論文を紹介します。と言いつつ実際はほぼ個人の備忘録です。要約・翻訳ともに自動化しているためたまに問題のある投稿が発生します。技術的な詳細はBlogをご参照ください。

記載されている内容は個人(Satoshi Takahashi)の見解であり、会社・所属機関の意見を代表するものではありません。

最近はBlog作成中に筆者のTwitter（@staka1982）でつぶやいています。

Pixels, Patterns, but No Poetry: To See The World like Humans

Pixels, Patterns, but No Poetry: To See The World like Humans [33.8]
最先端のMLLMは、人間にとって簡単な私たちの知覚上のタスクに破滅的な失敗を示します。この論文は、推論から知覚へと焦点を移す。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 21:50:16 GMT)
人間だと直感的に理解可能な Turing Eye Test (TET)の提案。「Through four diagnostic tasks involving concealed text, 3D Captchas, Chinese character compositions, and color blind test charts, we demonstrated that state-of-the-art MLLMs exhibit catastrophic failures on perceptual tasks that humans solve intuitively.」とAIにはとけないものが多い。創作漢字コンテストの漢字を理解できるか興味深いところ（leakが怖いが…）。
- 手元のo3-proではhttps://sousaku-kanji.com/archive/contest_15th.htmlは読めないようだった。
プロジェクトサイトはPixels, Patterns, but no Poetry: To See the World like Humans

Diffusion Beats Autoregressive in Data-Constrained Settings

Diffusion Beats Autoregressive in Data-Constrained Settings [46.1]
自己回帰(AR)モデルは長い間、大きな言語モデルのランドスケープを支配してきた。近年,ARモデルよりもアドバンテージが低いものの,拡散型言語モデルが将来性のある選択肢として浮上している。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:59:57 GMT)
「In this paper, we systematically study masked diffusion models in data-constrained settings—where training involves repeated passes over limited data—and find that they significantly outperform AR models when compute is abundant but data is scarce. Diffusion models make better use of repeated data, achieving lower validation loss and superior down- stream performance.」という指摘。直観的にもそうだろうと思う。
リポジトリはDiffusion Beats Autoregressive in Data-Constrained Settings

Docopilot: Improving Multimodal Models for Document-Level Understanding

Docopilot: Improving Multimodal Models for Document-Level Understanding [87.6]
マルチモーダル文書の詳細な理解を支援するために,高品質な文書レベルデータセットDoc-750Kを提案する。このデータセットには、さまざまなドキュメント構造、広範なクロスページ依存関係、および元のドキュメントから派生した実際の質問と回答のペアが含まれている。データセットに基づいて、RAGに頼ることなく、文書レベルの依存関係を正確に処理できるネイティブなマルチモーダルモデルであるDocopilotを開発する。
論文参考訳（メタデータ） (Sat, 19 Jul 2025 16:03:34 GMT)
大規模なマルチモーダルDocumentUnderstanding用データの構築とInternVL2ベースのモデル構築。「The proposed Docopilot-8B shows a notable improvement over baseline models [73], achieving a +19.9% accuracy gain compared to InternVL2-8B and surpassing InternVL2-26B with less than 31% of the inference latency. Additionally, Docopilot-2B uses fewer parameters (less than 10%) while exhibiting comparable performance to the 10× larger InternVL2-26B.」と性能向上。
リポジトリはOpenGVLab/Docopilot: [CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.4]
Visually-Rich Document Understanding (VRDU)は、複雑なビジュアル、テキスト、レイアウト情報を含む文書を自動的に処理する必要があるため、重要な分野として登場した。この調査はMLLMベースのVRDUの最近の進歩をレビューし、3つのコアコンポーネントを強調した。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 02:10:31 GMT)
図やレイアウトの取り扱いを含むDocument Understandingのサーベイ

Qwen3-Coder, Intern-S1, Step-Audio2, TeleChat2

Claude 4 sonnetレベルのQwen3 Coder（QwenLM/Qwen3-Coder: Qwen3-Coder is the code version of Qwen3, the large language model series developed by Qwen team, Alibaba Cloud.）、235B MoE language model (Qwen3) ＋ 6B Vision encoder (InternViT)で強力なマルチモーダルLRM Intern S1（InternLM/Intern-S1）、Kimi K2のテクニカルレポート公開（Kimi-K2/tech_report.pdf at main · MoonshotAI/Kimi-K2）、と中国のモデルに関する話題が多かった。Qwen3-Instruct-2507（QwenLM/Qwen3: Qwen3 is the large language model series developed by Qwen team, Alibaba Cloud.）でKIMI K2越えが主張されたりと競争が激しい。

音声関連でもStepFunからStep-Audio 2 Technical Report、TeleAIからTECHNICAL REPORT OF TELECHAT2, TELECHAT2.5 AND T1が公開されている。いずれも優れた性能を主張。加えてGR-3のようなロボット関連の論文にも興味津々。

そして、もう間もなく、GPT-5が発表されるはずで、進化は続きそう。

Step-Audio 2 Technical Report [108.0]
Step-Audio 2は、業界における音声理解と音声会話のために設計された、エンドツーエンドのマルチモーダルな大規模言語モデルである。遅延オーディオエンコーダと推論中心強化学習(RL)を統合することにより、Step-Audio 2は自動音声認識(ASR)および音声理解において有望な性能を達成する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 11:13:12 GMT)
リポジトリはstepfun-ai/Step-Audio2: Step-Audio 2 is an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation.

Technical Report of TeleChat2, TeleChat2.5 and T1 [40.9]
最新のTeleChatモデルについて紹介する: TeleChat2, TeleChat2.5, T1。モデルアーキテクチャの最小限の変更にもかかわらず、新しいシリーズは、強化されたトレーニング戦略によって、大幅なパフォーマンス向上を達成する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 01:00:48 GMT)
リポジトリはTele-AI/TeleChat2: 星辰语义大模型TeleChat2是由中国电信人工智能研究院研发训练的大语言模型，是首个完全国产算力训练并开源的千亿参数模型

GR-3 Technical Report [21.9]
GR-3は、大規模な視覚言語アクション(VLA)モデルである。抽象概念を含む新しいオブジェクト、環境、命令を一般化する際、例外的な能力を示す。 GR-3は、両手動操作や移動動作を必要とするタスクを含む、長い水平および外接なタスクの処理に長けている。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 10:54:13 GMT)
プロジェクトサイトはByteDance Seed

Apple Intelligence Foundation Language Models: Tech Report 2025 [246.0]
AppleのデバイスやサービスにまたがってAppleのインテリジェンス機能を駆動する2つの基礎言語モデルを紹介します。どちらのモデルも、責任あるWebクローリングを通じてソースされる大規模なマルチリンガルデータセットとマルチモーダルデータセットに基づいてトレーニングされている。新しいSwift中心のFoundation Modelsフレームワークでは、ガイド付き生成、制約付きツール呼び出し、LoRAアダプタの微調整が公開されている。
論文参考訳（メタデータ） (Thu, 17 Jul 2025 23:37:19 GMT)
Apple IntelligenceのテクニカルレポートがarXivに公開されていた。
「We found that AFM on-device model performs better than Qwen-2.5-3B, Gemma-3-4B and Gemma-3n-E4B on MMLU/MMMLU, but it lags slightly behind Gemma-3n-E4B on MGSM. AFM on-device model performs lower than the larger Qwen-3-4B model. AFM server models lag slightly to LLaMA 4 Scout, whose total size and active number of parameters are comparable, but has a bigger gap to larger models such as Qwen-3-235B and the proprietary GPT-4o.」と評価している。

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization

Expert-Guided LLM Reasoning for Battery Discovery: From AI-Driven Hypothesis to Synthesis and Characterization [48.0]
大型言語モデル(LLM)は複雑な問題に対処するためにチェーン・オブ・シント(CoT)技術を利用する。ドメイン知識を統合した新しいエージェントフレームワークであるChatBatteryを,材料設計におけるより効果的な推論に向けて導入する。新規リチウムイオン電池陰極材料3種を同定,合成,特性評価し,28.8%,25.2%,18.5%の実用能力向上を実現した。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 23:46:11 GMT)
科学的発見を支援するAI、「ChatBattery is an AI-driven material optimization platform structured into two synergistic phases: exploration and exploitation. Together, these phases encompass eight sequential stages, orchestrated by seven specialized agents.」とかなり複雑な構成のマルチエージェントシステムになっている。加えて、人間とのコラボレーションが重視されているように見える。
- This suggests that ChatBattery, in its present form, is more adept at optimizing within known paradigms than at generating fundamentally new chemistries. As such, expert input remains essential to expand the system’s exploration boundaries and push beyond conventional chemical spaces. Importantly, this interplay between AI-driven generation and human-guided refinement also creates unexpected opportunities, as demonstrated in the refinement of AI-suggested materials into even more advanced cathode compositions. However, advances anticipated with future reasoning AIs are likely to provide greater exploration and creativity.という記載がある。
「ChatBattery, we successfully identify, synthesize, and characterize three novel lithiumion battery cathode materials, which achieve practical capacity improvements of 28.8%, 25.2%, and 18.5%, respectively, over the widely used cathode material, LiNi0.8Mn0.1Co0.1O2 (NMC811).」と効果があったとのこと。

Checklists Are Better Than Reward Models For Aligning Language Models

Checklists Are Better Than Reward Models For Aligning Language Models [99.2]
チェックリストフィードバックからの強化学習(RLCF)を提案する。指示からチェックリストを抽出し,各項目の応答がどの程度満足するかを評価する。これらのスコアをAI判断器と特殊検証器プログラムの両方を用いて組み合わせ、RLの報酬を計算する。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 17:58:00 GMT)
「”how can we grade responses to instructions in a manner that is automatic (requires no human annotation), flexible (considers all aspects of response quality), intuitive (aligned with perceptible differences in responses), and applicable to any instruction or response, to enable more effective use of RL in language model alignment?” 」に対してチェックリスト生成とチェックリストを元にしたフィードバックによる強化学習を提案。「From instructions, we extract checklists and evaluate how well responses satisfy each item—using both AI judges and specialized verifier programs—then combine these scores to compute rewards for RL. We compare RLCF with other alignment methods applied to a strong instruction following model (Qwen2.5-7B-Instruct) on five widely-studied benchmarks – RLCF is the only method to improve performance on every benchmark, including a 4-point boost in hard satisfaction rate on FollowBench, a 6-point increase on InFoBench, and a 3-point rise in win rate on Arena-Hard.」と効果を確認。
大規模モデルでチェックリスト生成、それを使って“Reinforcement Learning from Checklist Feedback” (RLCF)と、大規模モデルからの蒸留文脈での効果が大きそうだが性能向上に効果があるのが興味深い。（Limitationにある通り計算コストは高いとのこと）

AlphaGo Moment for Model Architecture Discovery

AlphaGo Moment for Model Architecture Discovery [26.3]
AI研究のための人工超知能の最初の実証であるAII-Archを紹介する。 ASI-Archは完全に自律的なシステムで、AIが独自のアーキテクチャ革新を実行できるようにすることによって制約を揺さぶる。我々は2万時間にわたって1,773個の自律的な実験を行い、106個の革新的なSOTA(State-of-the-art)線形アテンションアーキテクチャを発見しました。
論文参考訳（メタデータ） (Thu, 24 Jul 2025 03:57:27 GMT)
ASIをタイトルに入れた興味深い論文、「ASI-ARCH conducted 1,773 autonomous experiments over 20,000 GPU hours, culminating in the discovery of 106 innovative, state-of-the-art (SOTA) linear attention architectures.」と主張。
- Language Modeling by Language Models – arXiv最新論文の紹介との差異やより実用・大規模なパラメータ・データ・計算コストでの結果が気になる。
- そのうち最近出ていた下記成果のような複合的な効率化まで扱えるようになるのだろうか。
リポジトリはGAIR-NLP/ASI-Arch: AlphaGo Moment for Model Architecture Discovery.、Neural Network Research Data Gallery

Scaling Linear Attention with Sparse State Expansion [58.2]
トランスフォーマーアーキテクチャは、2次計算と線形メモリ成長による長期コンテキストシナリオに苦慮している。本稿では,情報分類として状態更新を概念化し,線形注意のための行スパース更新定式化を提案する。次に、スパースフレームワーク内にスパース状態拡張(SSE)を示し、コンテキスト状態を複数のパーティションに拡張する。
論文参考訳（メタデータ） (Tue, 22 Jul 2025 13:27:31 GMT)

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance

Enabling Self-Improving Agents to Learn at Test Time With Human-In-The-Loop Guidance [39.6]
大規模言語モデル(LLM)エージェントは、しばしばルールや必要なドメイン知識が頻繁に変化する環境で苦労する。テスト時に更新されたドメイン知識を継続的に学習するための適応反射型対話エージェント(ARIA)を提案する。 ARIAはTikTok Pay内にデプロイされ、月間アクティブユーザ数は1億5000万を超えている。
論文参考訳（メタデータ） (Wed, 23 Jul 2025 02:12:32 GMT)
「ARIA addresses conventional model limitations in dynamic environments by as- sessing uncertainty via self-dialogue, soliciting expert corrections, and updating a timestamped, conflict-resolving knowledge base.」と記憶を通じた自己改善を行っていくフレームワークの提案。実際にデプロイされているのがすごい。
リポジトリはyf-he/aria

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra

LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra [29.6]
本稿では,エージェント・ベース・モデリングを用いて経済政策を設計・評価する新しい枠組みを提案する。下位レベルでは、有界な労働者エージェントは、テキストベースのユーティリティ関数をテキストで学習するために労働供給を選択する。上位のレベルでは、プランナーエージェントは、現在の連邦政府の括弧に固定された一貫した境界税制を提案するために、文脈内強化学習を採用する。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 17:21:14 GMT)
「Our results show that a Llama-3 model can (i) recover the Mirrleesian trade-off between equity and efficiency, (ii) approach Saez-optimal schedules in heterogeneous settings where analytical formulas are unavailable, and (iii) reproduce political phenomena—such as majority exploitation and welfare-enhancing leader turnover—without any hand-crafted rules. Taken together, the experiments suggest that large language models can serve as tractable test beds for policy design long before real-world deployment, providing a bridge between modern generative AI and classical economic theory.」とのこと。LLM basedなマルチエージェントシミュレーションとして興味深い結果であるのと、（凝ったアプローチのように見えるが）Llama-3.1-8B-InstructでOKというのが若干驚き。
リポジトリはsethkarten/LLM-Economist: Official repository of the 2025 paper, LLM Economist: Large Population Models and Mechanism Design in Multi-Agent Generative Simulacra.

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31