2025年11月11日 – arXiv最新論文の紹介

Thinking with Video, V-Thinker

推論時にマルチモーダルなデータを活用する研究が進んでいる。

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm [73.5]
シンキング・ウィズ・ビデオ」パラダイムは、視覚的・テキスト的推論を統合的時間的枠組みで橋渡しする。 Sora-2はビジョン中心のタスクの有能な推論者として確立されている。テキスト中心のタスクでは、Sora-2はMATHで92%、MMMUで75.53%の精度を達成している。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 17:25:23 GMT)
「Moving beyond the traditional paradigms of “Thinking with Text” (e g , Chain-of-Thought [3, 37]) and “Thinking with Images”, we propose “Thinking with Video”. It naturally enables human-like dynamic reasoning through video generation, such as drawing and imagination.」と動画を使った思考。
プロジェクトサイトはThinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm、リポジトリはGitHub – tongjingqi/Thinking-with-Video: We introduce “Thinking with Video”, a new paradigm leveraging video generation for unified multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 75% accuracy on MMMU, positioning video generation as a promising multimodal reasoning paradigm.

V-Thinker: Interactive Thinking with Images [22.6]
視覚中心推論の最近の進歩は、大型マルチモーダルモデル(LMM)のための有望な「シンキング・ウィズ・イメージ」パラダイムを探求しているエンド・ツー・エンドの強化学習を通じてインタラクティブな視覚中心の思考を可能にする汎用マルチモーダル推論アシスタントであるV-Thinkerを提案する。 V-Thinkerは、一般的な推論シナリオと対話的な推論シナリオの両方において、強力なLMMベースのベースラインを一貫して上回る。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 15:32:29 GMT)
「we introduce V-Thinker, a general-purpose multimodal reasoning assistant that fosters interactive vision-centric thinking via end-to-end reinforcement training.」と視覚を活用した思考を行うアシスタントの提案。
リポジトリはGitHub – We-Math/V-Thinker

ToM: Leveraging Tree-oriented MapReduce for Long-Context Reasoning in Large Language Models [107.9]
ToMは、長期コンテキスト推論のための新しいツリー指向MapReduceフレームワークである。 ToM は既存の分割・クエリー・フレームワークや検索拡張生成手法よりも大幅に優れていることを示す。
論文参考訳（メタデータ） (Sat, 01 Nov 2025 10:43:58 GMT)
「Leveraging a tree- structured MapReduce approach, ToM performs recursive reasoning over documents to enhance long-context understanding. It consists of two key components: DocTree Construction: ToM first applies Hierarchical Semantic Parsing to convert each chunk into a structured subtree, then combines these subtrees into a hierarchical DocTree through Bottom-up Aggregation. 2). Recursive Reasoning via MapReduce: ToM performs recursive reasoning on the DocTree in a MapReduce fashion, enabling systematic aggregation of rationales across the hierarchy. 」とTree構造化＆MapReduceを用いる長文処理の提案。一般的なRAGよりも性能が良いとのこと。
リポジトリはGitHub – gjn12-31/ToM

Global PIQA: Evaluating Physical Commonsense Reasoning Across 100+ Languages and Cultures [118.0]
我々は100以上の言語を対象とした参加型コモンセンス推論ベンチマークであるGlobal PIQAを提案する。グローバルPIQAの116の言語変種は、5つの大陸、14の言語族、23の文字体系をカバーしている。グローバルPIQAの非並列分割では、50%以上の例が地元の食品、習慣、伝統、その他の文化的特有な要素を参照している。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 05:46:25 GMT)
「we have presented Global PIQA, a physical commonsense reasoning benchmark covering 116 language varieties. Unlike previous benchmarks, Global PIQA is a participatory benchmark, constructed by hand by 335 researchers across 65 countries.」とマルチリンガルなベンチマーク。
日本語のデータも入っている。（不穏なデータっぽく見えるものもあり、全体的にチェックしてみようかと思わなくもない）
データはmrlbenchmarks/global-piqa-nonparallel · Datasets at Hugging Face、プロジェクトサイトはMRL Benchmarks