{"id":7809,"date":"2025-11-25T06:42:00","date_gmt":"2025-11-24T21:42:00","guid":{"rendered":"https:\/\/devneko.jp\/wordpress\/?p=7809"},"modified":"2025-11-23T08:55:10","modified_gmt":"2025-11-22T23:55:10","slug":"think-visually-reason-textually-vision-language-synergy-in-arc-arc-is-a-vision-problem","status":"publish","type":"post","link":"https:\/\/devneko.jp\/wordpress\/?p=7809","title":{"rendered":"Think Visually, Reason Textually: Vision-Language Synergy in ARC\u00a0\/ ARC Is a Vision Problem!"},"content":{"rendered":"\n<ul class=\"wp-block-list\">\n<li><strong>Think Visually, Reason Textually: Vision-Language Synergy in ARC\u00a0<\/strong>[94.2]<br>ARC-AGI\u306f\u3001\u6982\u5ff5\u30eb\u30fc\u30eb\u306e\u8a98\u5c0e\u3068\u65b0\u3057\u3044\u30bf\u30b9\u30af\u3078\u306e\u8ee2\u9001\u306e\u305f\u3081\u306e\u53b3\u683c\u306a\u30c6\u30b9\u30c8\u30d9\u30c3\u30c9\u3067\u3042\u308b\u3002 \u753b\u50cf\u304c\u4e0d\u6b63\u78ba\u306a\u30eb\u30fc\u30eb\u306e\u5b9f\u884c\u306b\u3088\u3063\u3066\u30d1\u30d5\u30a9\u30fc\u30de\u30f3\u30b9\u304c\u4f4e\u4e0b\u3059\u308b\u306b\u3064\u308c\u3066\u3001ARC-AGI\u30b0\u30ea\u30c3\u30c9\u3092\u30cd\u30a4\u30c6\u30a3\u30d6\u306b\u30ec\u30f3\u30c0\u30ea\u30f3\u30b0\u3059\u308b\u3002 \u6211\u3005\u306f\u3001ARC-AGI\u3092\u30e2\u30c0\u30ea\u30c6\u30a3\u6574\u5217\u30b5\u30d6\u30bf\u30b9\u30af\u306b\u5206\u89e3\u3059\u308bVLSR(Vision-Language Synergy Reasoning)\u3068\u3001\u672c\u8cea\u7684\u306a\u8aa4\u308a\u8a02\u6b63\u306e\u305f\u3081\u306e\u30c6\u30ad\u30b9\u30c8\u30d9\u30fc\u30b9\u306e\u63a8\u8ad6\u3092\u8996\u899a\u3092\u5229\u7528\u3057\u3066\u691c\u8a3c\u3059\u308bMSSC(Modality-Switch Self-Correction)\u3068\u3044\u30462\u3064\u306e\u76f8\u4e57\u7684\u6226\u7565\u3092\u5c0e\u5165\u3059\u308b\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2511.15703v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2511.15703v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Wed, 19 Nov 2025 18:59:04 GMT)<\/li>\n\n\n\n<li>\u300cOur analysis of the OpenAI o4-mini model reveals striking differences: vision ex- cels at rule summarization, providing a 3.0% improvement through its holistic perception of 2D spatial structures, while text excels at rule application, with vision causing a dramatic 20.5% performance drop due to imprecise element-wise manipulation. These findings demonstrate that the question is not whether to use vision or text, but rather when and how to strategically combine them.\u300d\u3068\u3044\u3046\u6307\u6458\u3068\u3001\u300cBy fine-tuning separate models for visual rule summarization and textual rule application, our approach achieves a 3.5% improvement over text-only fine-tuning on the same training data, enabling small open-source models (Qwen3-8B) to surpass closed-source models like GPT-4o.\u300d\u3068\u306e\u3053\u3068\u3002<\/li>\n<\/ul>\n\n\n\n<ul class=\"wp-block-list\">\n<li><strong>ARC Is a Vision Problem!\u00a0<\/strong>[50.6]<br>\u8996\u899a\u30d1\u30e9\u30c0\u30a4\u30e0\u5185\u306eARC\u3092\u753b\u50cf\u304b\u3089\u753b\u50cf\u3078\u306e\u5909\u63db\u554f\u984c\u3068\u3057\u3066\u5b9a\u7fa9\u3059\u308b\u3002 \u79c1\u305f\u3061\u306e\u30d5\u30ec\u30fc\u30e0\u30ef\u30fc\u30af\u3067\u3042\u308bVision ARC\u306f\u3001ARC-1\u30d9\u30f3\u30c1\u30de\u30fc\u30af\u306760.4%\u306e\u7cbe\u5ea6\u3092\u5b9f\u73fe\u3057\u3066\u3044\u307e\u3059\u3002<br><a href=\"http:\/\/arxiv.org\/abs\/2511.14761v1\">\u8ad6\u6587<\/a>\u00a0\u00a0<a href=\"https:\/\/fugumt.com\/fugumt\/paper_check\/2511.14761v1\">\u53c2\u8003\u8a33\uff08\u30e1\u30bf\u30c7\u30fc\u30bf\uff09<\/a>\u00a0 \u00a0(Tue, 18 Nov 2025 18:59:49 GMT)<\/li>\n\n\n\n<li>\u3053\u3061\u3089\u306f\u8ad6\u6587\u540d\u306e\u901a\u308a\u3001\u300calthough the puzzle-like tasks in ARC are inherently visual, existing research has rarely approached the problem from a vision-centric perspective. In this work, we formulate ARC within a vision paradigm, framing it as an image-to-image translation problem.\u300d\u3068Vision\u306e\u554f\u984c\u3068\u3057\u3066\u89e3\u3044\u3066\u9ad8\u30b9\u30b3\u30a2\u3092\u9054\u6210\u3002<\/li>\n\n\n\n<li>\u30d7\u30ed\u30b8\u30a7\u30af\u30c8\u30b5\u30a4\u30c8\u306f<a href=\"https:\/\/github.com\/lillian039\/VARC\">GitHub &#8211; lillian039\/VARC<\/a><\/li>\n\n\n\n<li>\u300cIt is natural to explore vision driven approaches for ARC. On the other hand, human reasoning is not confined to language or vision in isolation, but instead should integrate information across modalities. With our complementary vision-based perspective, we hope the scope of abstract reasoning will be further broadened.\u300d\u3068\u306e\u6307\u6458\u306f\u305d\u306e\u901a\u308a\u3060\u3068\u601d\u3046\u3002<a href=\"https:\/\/devneko.jp\/wordpress\/?p=7690\">Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark \u2013 arXiv\u6700\u65b0\u8ad6\u6587\u306e\u7d39\u4ecb<\/a>\u306e\u3088\u3046\u306a\u6307\u6458\u3002NanoBanana\u306e\u5370\u8c61\u7684\u306a\u6027\u80fd\u306a\u3069\u3046\u307e\u304f\u7d71\u5408\u3055\u308c\u3066\u3044\u304f\u3068AGI\u306b\u8fd1\u3065\u3044\u3066\u3044\u304f\u3093\u3060\u308d\u3046\u306a\u3068\u3044\u3046\u611f\u899a\u304c\u3042\u308b\u3002<\/li>\n<\/ul>\n","protected":false},"excerpt":{"rendered":"","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[2],"tags":[232,434],"class_list":["post-7809","post","type-post","status-publish","format-standard","hentry","category-arxiv","tag-lrm","tag-vision-language"],"_links":{"self":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7809","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=7809"}],"version-history":[{"count":1,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7809\/revisions"}],"predecessor-version":[{"id":7810,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=\/wp\/v2\/posts\/7809\/revisions\/7810"}],"wp:attachment":[{"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=7809"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=7809"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/devneko.jp\/wordpress\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=7809"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}