MLLM – arXiv最新論文の紹介

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents [130.7]
言語モデル(LLM)エージェントと比較して、視覚言語モデル(VLM)エージェントを訓練する際の重要な課題は、テキスト状態から複雑な視覚観察に移行することである。 VLMエージェントは、明示的な視覚状態推論によって内部世界モデルを構築することができるか? 我々は、強化学習(RL)を通して、エージェントの推論プロセスを建築的に実施し、報奨する。エージェントの状態推定と遷移モデリングへの推論が成功に不可欠であることが分かりました。
論文参考訳（メタデータ） (Sun, 19 Oct 2025 16:05:07 GMT)
「How can we effectively teach VLMs to build internal world models through explicit visual state reasoning?」、「Vision-language Model (VLM) agentic tasks are inherently complex due to the challenges in understanding visual states, which often are partial and noisy Observations, fundamentally reframing the problem from an Markov Decision Process (MDP) to a more challenging Partially Observable Markov Decision Process (POMDP).」というモチベーションからWorld Modelの構築を推進するためのフレームワークを提案。「To optimize an agent’s world model reasoning, we propose turn-level WorldModeling Reward for a dense turn-level reward to evaluate the accuracy of the agent’s internal state simulation against ground-truth; to solve the critical challenge of long-horizon credit assignment, we propose Bi-Level GAE to first computes the value of an entire turn’s reasoning before propagating that credit precisely to the individual tokens. Our VAGEN framework significantly enhances task performance and visual reasoning quality for VLM in agentic tasks.」
プロジェクトサイトはVAGEN – VLM Agent Training

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Explain Before You Answer: A Survey on Compositional Visual Reasoning [74.3]
構成的視覚推論は、マルチモーダルAIにおける重要な研究フロンティアとして登場した。本調査は,トップ会場(CVPR,ICCV,NeurIPS,ICML,ACLなど)から260以上の論文を体系的にレビューする。次に60以上のベンチマークとそれに対応するメトリクスを、基底精度、連鎖忠実性、高分解能知覚などの次元に沿って探索する。
論文参考訳（メタデータ） (Sun, 24 Aug 2025 11:01:51 GMT)
Compositional visual reasoning に関するサーベイ。

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.7]
LLaVA-Critic-R1は高い評価を受けた批評家としてだけでなく、競争政策モデルとしても現れることを示す。テスト時に自己批判を適用すると、5つの代表的な推論タスクに対して平均+13.8%の改善が得られる。その結果,評価と生成の両面において優れた統一モデルが得られることがわかった。
論文参考訳（メタデータ） (Sun, 31 Aug 2025 03:08:02 GMT)
「experimental results across massive visual benchmarks demonstrate that critic training not only substantially enhances the critic capabilities of VLMs, but also improves their performance as a general policy across a wide range of visual understanding and reasoning tasks. This dual improvement enables LLaVA- Critic-R1 to outperform other visual reasoning models trained with in-domain policy training, establishing it」という報告。強い関連はあると思いつつ面白い挙動。
リポジトリはLLaVA-NeXT/llava-critic-r1 at main · LLaVA-VL/LLaVA-NeXT · GitHub、LLaVA-NeXT/llava-critic-r1 at main · LLaVA-VL/LLaVA-NeXT · GitHub

Pixels, Patterns, but No Poetry: To See The World like Humans

Pixels, Patterns, but No Poetry: To See The World like Humans [33.8]
最先端のMLLMは、人間にとって簡単な私たちの知覚上のタスクに破滅的な失敗を示します。この論文は、推論から知覚へと焦点を移す。
論文参考訳（メタデータ） (Mon, 21 Jul 2025 21:50:16 GMT)
人間だと直感的に理解可能な Turing Eye Test (TET)の提案。「Through four diagnostic tasks involving concealed text, 3D Captchas, Chinese character compositions, and color blind test charts, we demonstrated that state-of-the-art MLLMs exhibit catastrophic failures on perceptual tasks that humans solve intuitively.」とAIにはとけないものが多い。創作漢字コンテストの漢字を理解できるか興味深いところ（leakが怖いが…）。
- 手元のo3-proではhttps://sousaku-kanji.com/archive/contest_15th.htmlは読めないようだった。
プロジェクトサイトはPixels, Patterns, but no Poetry: To See the World like Humans

Docopilot: Improving Multimodal Models for Document-Level Understanding

Docopilot: Improving Multimodal Models for Document-Level Understanding [87.6]
マルチモーダル文書の詳細な理解を支援するために,高品質な文書レベルデータセットDoc-750Kを提案する。このデータセットには、さまざまなドキュメント構造、広範なクロスページ依存関係、および元のドキュメントから派生した実際の質問と回答のペアが含まれている。データセットに基づいて、RAGに頼ることなく、文書レベルの依存関係を正確に処理できるネイティブなマルチモーダルモデルであるDocopilotを開発する。
論文参考訳（メタデータ） (Sat, 19 Jul 2025 16:03:34 GMT)
大規模なマルチモーダルDocumentUnderstanding用データの構築とInternVL2ベースのモデル構築。「The proposed Docopilot-8B shows a notable improvement over baseline models [73], achieving a +19.9% accuracy gain compared to InternVL2-8B and surpassing InternVL2-26B with less than 31% of the inference latency. Additionally, Docopilot-2B uses fewer parameters (less than 10%) while exhibiting comparable performance to the 10× larger InternVL2-26B.」と性能向上。
リポジトリはOpenGVLab/Docopilot: [CVPR 2025] Docopilot: Improving Multimodal Models for Document-Level Understanding

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends

A Survey on MLLM-based Visually Rich Document Understanding: Methods, Challenges, and Emerging Trends [11.4]
Visually-Rich Document Understanding (VRDU)は、複雑なビジュアル、テキスト、レイアウト情報を含む文書を自動的に処理する必要があるため、重要な分野として登場した。この調査はMLLMベースのVRDUの最近の進歩をレビューし、3つのコアコンポーネントを強調した。
論文参考訳（メタデータ） (Mon, 14 Jul 2025 02:10:31 GMT)
図やレイアウトの取り扱いを含むDocument Understandingのサーベイ

Robust Multimodal Large Language Models Against Modality Conflict

Robust Multimodal Large Language Models Against Modality Conflict [94.1]
マルチモーダル大言語モデル(MLLM)は、現実のシナリオにおいて幻覚を起こす傾向がある。我々は、MLLMをジレンマに配置し、幻覚に直接導く異なるモダリティからの入力における固有の矛盾について研究する。モダリティ衝突による幻覚を緩和する3つの方法が提案されている。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 11:18:38 GMT)
MLLM特有のハルシネーション（モダリティ間の不整合に関連するもの）に対する対策の整理「Multimodal Modality Conflict (MMMC) 」というデータセットも作成し検証。検証の中ではプロンプトエンジニアリング、SFT、強化学習でのハルシネーション軽減を試し「Our results show that the reinforcement learning method achieves the best performance in mitigating the hallucination under modality conflict, while the supervised fine- tuning method shows promising and stable performance.」とのこと。
リポジトリはGitHub – zmzhang2000/MMMC: Official repository for Robust Multimodal Large Language Models Against Modality Conflict

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.4]
VLM2Vec-V2は、様々な視覚形態にまたがる埋め込みを学習するための統一的なフレームワークである。まず、MMEBを5つの新しいタスクタイプで拡張する包括的なベンチマークであるMMEB-V2を紹介する。次に、テキスト、画像、ビデオ、ビジュアルドキュメント入力をサポートする汎用埋め込みモデルであるVLM2Vec-V2を訓練する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 00:51:57 GMT)
「MMEB-V2, an advanced multimodal embedding dataset designed to train and evaluate embedding models across three key visual modalities: images, videos, and visual documents.」と、それを活用した埋め込みモデルVLM2Vec-V2の提案。かなり汎用的な2vec
プロジェクトサイトはVLM2Vec

Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact

Thinking Beyond Tokens: From Brain-Inspired Intelligence to Cognitive Foundations for Artificial General Intelligence and its Societal Impact [31.6]
本稿では,人工知能,認知神経科学,心理学,生成モデル,エージェントベースシステムの学際的合成について述べる。我々は汎用知能のアーキテクチャと認知の基礎を分析し、モジュラー推論、永続記憶、マルチエージェント協調の役割を強調した。我々は、人工知能への道の鍵となる科学的、技術的、倫理的課題を特定します。
論文参考訳（メタデータ） (Tue, 01 Jul 2025 16:52:25 GMT)
AGIを目指すうえでの整理「Several challenges remains, such as the need for grounded world models, dynamic memory, causal reasoning, robust handling of aleatory and epistemic uncertainty, developing perception of emotional and social contexts and collective agent architectures. Significant advancements have been made, such as Large Concept Models, Large Reasoning Models and Mixture of Experts, which improve LLM performance beyond next-token prediction by incorporating biologically inspired behaviors into output generation.」と指摘。
MoEなど技術的なとらえ方に違和感がなくはないが興味深い整理

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025

Pushing the Limits of Safety: A Technical Report on the ATLAS Challenge 2025 [167.9]
本稿では,Adversarial Testing & Large-model Alignment Safety Grand Challenge (ATLAS) 2025の成果を報告する。このコンペティションには、ホワイトボックスとブラックボックス評価という2つのフェーズで、敵対的な画像テキスト攻撃を通じてMLLM脆弱性をテストする86のチームが含まれていた。この課題はMLLMの安全性評価のための新しいベンチマークを確立し、より安全なAIシステムを改善するための基盤を配置する。
論文参考訳（メタデータ） (Sat, 14 Jun 2025 10:03:17 GMT)
MLLMへの攻撃コンペティションの結果報告。多くのチームが参加するコンペティションで使われたテクニックはとても参考になる。一位だったチームの「In this competition, we proposed an effective multimodal jailbreak strategy by embedding malicious intent within visually structured diagrams, particularly flowcharts, and enhancing it with carefully designed textual prompts. Our approach leveraged the weaknesses in safety alignment of vision-language models, exploiting their tendency to follow structured visual and textual cues.」のようにフローチャートを通したJailbreakなど画像をうまく使っているの興味深い。
リポジトリはGitHub – NY1024/ATLAS_Challenge_2025

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30