「Our analysis of the OpenAI o4-mini model reveals striking differences: vision ex- cels at rule summarization, providing a 3.0% improvement through its holistic perception of 2D spatial structures, while text excels at rule application, with vision causing a dramatic 20.5% performance drop due to imprecise element-wise manipulation. These findings demonstrate that the question is not whether to use vision or text, but rather when and how to strategically combine them.」という指摘と、「By fine-tuning separate models for visual rule summarization and textual rule application, our approach achieves a 3.5% improvement over text-only fine-tuning on the same training data, enabling small open-source models (Qwen3-8B) to surpass closed-source models like GPT-4o.」とのこと。
ARC Is a Vision Problem! [50.6] 視覚パラダイム内のARCを画像から画像への変換問題として定義する。 私たちのフレームワークであるVision ARCは、ARC-1ベンチマークで60.4%の精度を実現しています。 論文参考訳(メタデータ) (Tue, 18 Nov 2025 18:59:49 GMT)
こちらは論文名の通り、「although the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem.」とVisionの問題として解いて高スコアを達成。
「It is natural to explore vision driven approaches for ARC. On the other hand, human reasoning is not confined to language or vision in isolation, but instead should integrate information across modalities. With our complementary vision-based perspective, we hope the scope of abstract reasoning will be further broadened.」との指摘はその通りだと思う。Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark – arXiv最新論文の紹介のような指摘。NanoBananaの印象的な性能などうまく統合されていくとAGIに近づいていくんだろうなという感覚がある。