arXiv – ページ 24 – arXiv最新論文の紹介

10 Open Challenges Steering the Future of Vision-Language-Action Models

10 Open Challenges Steering the Future of Vision-Language-Action Models [57.8]
視覚言語アクション(VLA)モデルは、具体化されたAIアリーナでますます普及している。 VLAモデルの開発における10のマイルストーンについて論じる。
論文参考訳（メタデータ） (Sat, 08 Nov 2025 09:02:13 GMT)
Vision-Language-Actionモデルにおける課題の整理

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark [48.0]
ビデオ生成モデルは、Chain-of-Frames (CoF)推論を通じて、潜在的な世界シミュレータとして登場した。既存のベンチマークは、忠実さやアライメントに重点を置いており、CoFの推論を評価していない。我々は,認知科学と実世界のAI応用を基盤としたフレームワークであるGen-ViReを紹介する。
論文参考訳（メタデータ） (Mon, 17 Nov 2025 19:11:39 GMT)
ビデオ生成モデルを通じた因果関係の把握（world modelへの可能性）を評価するベンチマークの提案。「Gen-ViRe evaluates six core cognitive dimensions: (1) Perceptual, (2) Analogical, (3) Abstract, (4) Planning, (5) Spatial & Temporal, and (6) Algorithmic & Logical, with each dimension comprising four different sub-categories.」
「Sora-2 achieves the highest overall score (0.560), establishing the top tier with particularly strong performance in the most cognitively demanding domains: “Abstract Reasoning” (0.604), “Algorithmic & Logical” (0.472), and “Perceptual” (0.496). The second tier comprises three highly competitive models—Hailuo-2.3 (0.493), Wan-2.5 (0.490), and Veo-3.1 (0.486)—each exhibiting distinct specialized strengths. Hailuo-2.3 achieves the highest score in “Planning” (0.778), showcasing exceptional sequential decision-making capabilities, while Wan-2.5 leads in “Analogy” (0.500), excelling at analogical reasoning.」とモデルごとに特性がかなり異なるのが興味深い。
リポジトリはhttps://github.com/L-CodingSpace/GVR

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling [115.7]
MiroThinkerは、ツール拡張推論と情報検索機能を向上させるために設計されたオープンソースの研究エージェントである。モデルサイズやコンテキスト長のみをスケールアップする以前のエージェントとは異なり、MiroThinker氏はモデルレベルでのインタラクションスケーリングについて検討している。
論文参考訳（メタデータ） (Tue, 18 Nov 2025 15:45:29 GMT)
「MiroThinker v1.0, an open-source research agent that advances tool-augmented reasoning through model, context, and interactive scaling.」とオープンなRAGではなくTool Augmentedなエージェント。GAIAのスコアがとても高い。
デモはMiroThinker、リポジトリはGitHub – MiroMindAI/MiroThinker: MiroThinker is open-source agentic models trained for deep research and complex tool use scenarios.

Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges

Twenty-Five Years of MIR Research: Achievements, Practices, Evaluations, and Future Challenges [68.5]
我々は過去25年間に音楽情報検索(MIR)の進化を辿った。 MIRは音楽情報学に関するあらゆる研究を集めている。我々は、MIR研究の急速な発展を後押しする一連の成功事例をレビューする。
論文参考訳（メタデータ） (Mon, 10 Nov 2025 15:32:23 GMT)
Music Information Retrievalに関する（短い）サーベイ

Think Visually, Reason Textually: Vision-Language Synergy in ARC / ARC Is a Vision Problem!

Think Visually, Reason Textually: Vision-Language Synergy in ARC [94.2]
ARC-AGIは、概念ルールの誘導と新しいタスクへの転送のための厳格なテストベッドである。画像が不正確なルールの実行によってパフォーマンスが低下するにつれて、ARC-AGIグリッドをネイティブにレンダリングする。我々は、ARC-AGIをモダリティ整列サブタスクに分解するVLSR(Vision-Language Synergy Reasoning)と、本質的な誤り訂正のためのテキストベースの推論を視覚を利用して検証するMSSC(Modality-Switch Self-Correction)という2つの相乗的戦略を導入する。
論文参考訳（メタデータ） (Wed, 19 Nov 2025 18:59:04 GMT)
「Our analysis of the OpenAI o4-mini model reveals striking differences: vision ex- cels at rule summarization, providing a 3.0% improvement through its holistic perception of 2D spatial structures, while text excels at rule application, with vision causing a dramatic 20.5% performance drop due to imprecise element-wise manipulation. These findings demonstrate that the question is not whether to use vision or text, but rather when and how to strategically combine them.」という指摘と、「By fine-tuning separate models for visual rule summarization and textual rule application, our approach achieves a 3.5% improvement over text-only fine-tuning on the same training data, enabling small open-source models (Qwen3-8B) to surpass closed-source models like GPT-4o.」とのこと。

ARC Is a Vision Problem! [50.6]
視覚パラダイム内のARCを画像から画像への変換問題として定義する。私たちのフレームワークであるVision ARCは、ARC-1ベンチマークで60.4%の精度を実現しています。
論文参考訳（メタデータ） (Tue, 18 Nov 2025 18:59:49 GMT)
こちらは論文名の通り、「although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem.」とVisionの問題として解いて高スコアを達成。
プロジェクトサイトはGitHub – lillian039/VARC
「It is natural to explore vision driven approaches for ARC. On the other hand, human reasoning is not confined to language or vision in isolation, but instead should integrate information across modalities. With our complementary vision-based perspective, we hope the scope of abstract reasoning will be further broadened.」との指摘はその通りだと思う。Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark – arXiv最新論文の紹介のような指摘。NanoBananaの印象的な性能などうまく統合されていくとAGIに近づいていくんだろうなという感覚がある。

Computer-Use Agents as Judges for Generative User Interface

Computer-Use Agents as Judges for Generative User Interface [142.8]
ComputerUse Agents (CUA) は、グラフィカルユーザインタフェース (GUI) を通じてデジタル環境を自律的に操作する能力が高まっている。ほとんどのGUIは、人間が効率的にタスクを実行する人間指向の動作を採用するために設計されている。 CUA は Coder でGUI の自動設計を支援することができるだろうか?
論文参考訳（メタデータ） (Wed, 19 Nov 2025 16:00:02 GMT)
「By positioning agents as both designers and judges, our framework shifts interface design toward agent-native efficiency and reliability. Our work takes a step toward shifting agents from passive use toward active participation in digital environments.」とエージェント時代のUIを考えるフレームワークをの提案。
対エージェントが対個人になっても良いわけで興味深い発想。
プロジェクトサイトはComputer-Use Agents as Judges for Generative User Interface、リポジトリはGitHub – showlab/AUI: Computer-Use Agents as Judges for Generative UI

SSR: Socratic Self-Refine for Large Language Model Reasoning

SSR: Socratic Self-Refine for Large Language Model Reasoning [78.6]
Socratic Self-Refine (SSR)は、大規模言語モデル(LLM)のきめ細かい評価と精度向上のための新しいフレームワークである。提案したSSRはモデル応答を検証可能な(サブクエスト,サブサブアンサー)ペアに分解し,ステップレベルの信頼度推定を可能にする。 5つの推論ベンチマークと3つのLCMによる実証的な結果から、SSRは一貫して最先端の反復的自己修正ベースラインを上回っていることが分かる。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 02:00:16 GMT)
「We propose a novel framework, Socratic Self-Refine (SSR), that allows more fine-grained confidence estimation and precise error control over decomposed reasoning steps. By formulating reasoning as a sequence of (sub-question, sub-answer) pairs, SSR overcomes the limitations of existing holistic self-refinement methods.」というフレームワークを提案、効果を確認。
リポジトリはGitHub – SalesforceAIResearch/socratic-self-refine-reasoning

Grok 4.1, Gemini 3Pro, GPT-5.1 Pro / Codex , Nano Banana Pro (Gemini Image Pro), Olmo 3, Step-Audio-R1, Omnilingual ASR

先週はフロンティアモデルレベルでの激戦がよくわかる週であった。Grok 4.1（Grok 4.1 | xAI）、Gemini3 Pro（Gemini 3 Pro – Google DeepMind、GPT-5.1 Pro（XユーザーのOpenAIさん: 「GPT-5.1 Pro is rolling out today to all Pro users. It delivers clearer, more capable answers for complex work, with strong gains in writing help, data science, and business tasks.」 / X）GPT-5.1-Codex-Max（Building more with GPT-5.1-Codex-Max | OpenAI）と大きな発表が相次いだ。公式のベンチマーク結果の他、様々な方が検証を行っていて、個人的にも検証をしているが、LLM/LRMの性能アップはまだいけるのではないか、と期待の持てる結果になっている。

Googleの Nano Banana Pro（XユーザーのGoogle AIさん: 「Rolling out today we are launching Nano Banana Pro, the world’s best image model built to move beyond casual creation and into a new era of studio-quality, functional design. Nano Banana Pro enables a new level of precision and creative control, transforming the way you bring https://t.co/BsyAgkUY7X」 / X）は画像生成のレベルが1段抜けている印象がある。Geminiのマルチモーダル性能が優れている点など総合力でさすがGoogleとの印象がある今日この頃。

公開モデルでもOlmo3（XユーザーのNathan Lambertさん: 「We present Olmo 3, our next family of fully open, leading language models. This family of 7B and 32B models represents: 1. The best 32B base model. 2. The best 7B Western thinking & instruct models. 3. The first 32B (or larger) fully open reasoning model. This is a big https://t.co/dpMtRHSjRp」 / X）が出ている。32Bモデルとしては最高レベルの性能（Olmo Improvement Benchmark）、音声領域におけるStep-Audio-R1、Omnilingual ASRなどこちらの流れも勢いは衰えていない。

Step-Audio-R1 Technical Report [70.4]
本稿では,音声領域における推論能力の解放に成功した最初の音声推論モデルであるStep-Audio-R1を紹介する。私たちのモデルは、Gemini 2.5 Proを抜いて、最先端のGemini 3 Proに匹敵するパフォーマンスを実現した、強力なオーディオ推論能力を示しています。
論文参考訳（メタデータ） (Wed, 19 Nov 2025 20:12:50 GMT)
Gemini 3 Proとも競合を主張、「Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain」
リポジトリはGitHub – stepfun-ai/Step-Audio-R1

Omnilingual ASR: Open-Source Multilingual Speech Recognition for 1600+ Languages [76.1]
大規模自動音声認識システムであるOmnilingual ASRを紹介する。自己教師付き事前学習を7Bパラメータに拡張し、堅牢な音声表現を学習する。 ASRが提供しなかった500以上の言語を含む1,600以上の言語にカバー範囲を広げている。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 01:04:28 GMT)
「Omnilingual ASR illustrates how scaling methods, when combined with deliberate data collection and new architectural innovation, can reshape the trajectory of multilingual ASR. The project not only extends coverage to more than 1,600 languages, with over 500 represented for the first time in any ASR system, but also reframes how coverage itself is conceived.」と非常に多くの言語をカバーするモデル
リポジトリはGitHub – facebookresearch/omnilingual-asr: Omnilingual ASR Open-Source Multilingual SpeechRecognition for 1600+ Languages

SAM 3D: 3Dfy Anything in Images

SAM 3D: 3Dfy Anything in Images [99.1]
画像から形状, テクスチャ, レイアウトを予測し, 視覚的な3Dオブジェクト再構成のための生成モデルSAM 3Dを提案する。オブジェクトの形状、テクスチャ、ポーズをアノテートするための、人間用およびモデル・イン・ザ・ループパイプラインでこれを実現する。コードとモデルの重み付け、オンラインデモ、そしてWild 3Dオブジェクト再構築のための新しい挑戦的なベンチマークをリリースします。
論文参考訳（メタデータ） (Thu, 20 Nov 2025 18:31:46 GMT)
「 SAM 3D, a generative model for visually grounded 3D object reconstruction, predicting geometry, texture, and layout from a single image」と3D reconstructionモデルであり、非常に高い品質に見える。LLMのようなアプローチで構築しているとのこと
- 「As in recent works, we first train on a large collection of rendered synthetic objects. This is supervised pretraining: our model learns a rich vocabulary for object shape and texture, preparing it for real-world reconstruction. Next is mid-training with semi-synthetic data produced by pasting rendered models into natural images. Finally, post-training adapts the model to real images, using both a novel model-in-the-loop (MITL) pipeline and human 3D artists, and aligns it to human preference. We find that synthetic pretraining generalizes, given adequate post-training on natural images.」
リポジトリはGitHub – facebookresearch/sam-3d-objects: SAM 3D Objects、プロジェクトサイトはSAM 3D

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation [98.5]
We present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation。 WeAVE-100kは、370Kのダイアログターンと500Kイメージにまたがる100Kのインターリーブサンプルの大規模なデータセットである。 WeAVEBenchは480の画像に基づいた100のタスクを備えた人手によるベンチマークである。
論文参考訳（メタデータ） (Fri, 14 Nov 2025 16:02:38 GMT)
「WEAVE- 100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context.」とマルチターンな生成に関するベンチマークの提案、評価方法は「we employ a key-point- based scoring approach using structured evaluation criteria.」
（最新版ではないようだが）NanoBananaのスコアがとても高い。
プロジェクトサイトはWeave

2026年6月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30