Visual Chain-of-Thought – arXiv最新論文の紹介

When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought [118.7]
MIRAは,中間画像の生成が推論の成功に不可欠であるシナリオにおいて,モデルを評価するために設計された新しいベンチマークである。 546のマルチモーダル問題を含み、中間画像と最終回答が注釈付きである。
論文参考訳（メタデータ） (Tue, 04 Nov 2025 18:00:51 GMT)
「To bridge this gap, we introduce MIRA (Multimodal Imagination for Reasoning Assessment), a benchmark designed to evaluate reasoning scenarios where generating or leveraging intermediate visual representations is essential. Each instance is constructed according to three principles: (1) requiring intermediate visual cues to answer the question, (2) pairing each instance with annotated step-wise visual clues to enable evaluation under a Visual-CoT setup, and (3) enforcing strict human annotation and cross-validation to guarantee data quality.」と視覚的・画像的な中間表現を必要とする推論のためのベンチマークの提案。フロンティアモデルでも難しいタスクになっている（が、公開モデルも健闘しているように見える）
プロジェクトサイトはWhen Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought

推論時にマルチモーダルなデータを活用する研究が進んでいる。

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm [73.5]
シンキング・ウィズ・ビデオ」パラダイムは、視覚的・テキスト的推論を統合的時間的枠組みで橋渡しする。 Sora-2はビジョン中心のタスクの有能な推論者として確立されている。テキスト中心のタスクでは、Sora-2はMATHで92%、MMMUで75.53%の精度を達成している。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 17:25:23 GMT)
「Moving beyond the traditional paradigms of “Thinking with Text” (e g , Chain-of-Thought [3, 37]) and “Thinking with Images”, we propose “Thinking with Video”. It naturally enables human-like dynamic reasoning through video generation, such as drawing and imagination.」と動画を使った思考。
プロジェクトサイトはThinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm、リポジトリはGitHub – tongjingqi/Thinking-with-Video: We introduce “Thinking with Video”, a new paradigm leveraging video generation for unified multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 75% accuracy on MMMU, positioning video generation as a promising multimodal reasoning paradigm.

V-Thinker: Interactive Thinking with Images [22.6]
視覚中心推論の最近の進歩は、大型マルチモーダルモデル(LMM)のための有望な「シンキング・ウィズ・イメージ」パラダイムを探求しているエンド・ツー・エンドの強化学習を通じてインタラクティブな視覚中心の思考を可能にする汎用マルチモーダル推論アシスタントであるV-Thinkerを提案する。 V-Thinkerは、一般的な推論シナリオと対話的な推論シナリオの両方において、強力なLMMベースのベースラインを一貫して上回る。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 15:32:29 GMT)
「we introduce V-Thinker, a general-purpose multimodal reasoning assistant that fosters interactive vision-centric thinking via end-to-end reinforcement training.」と視覚を活用した思考を行うアシスタントの提案。
リポジトリはGitHub – We-Math/V-Thinker

タグ: Visual Chain-of-Thought