Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding

Cognitive Mismatch in Multimodal Large Language Models for Discrete Symbol Understanding [96.8]
本稿では,最上位のMLLMが個別の意味空間をどのようにナビゲートするかを評価するためのベンチマークを紹介する。モデルは基本的なシンボル認識に失敗することが多いが、複雑な推論タスクに成功している。この作業は、より厳格で人間指向のインテリジェントなシステムを開発するためのロードマップを提供する。
論文参考訳（メタデータ） (Thu, 19 Mar 2026 04:08:20 GMT)
「despite impressive reasoning capabilities, current models frequently fail at foundational visual symbol grounding, relying instead on linguistic priors, procedural imitation, or memorized patterns. Our findings challenge a prevailing assumption in multimodal intelligence that visual recognition is inherently simpler than reasoning. Instead, we observe a consistent recognition-reasoning inversion phenomenon, where higher-level reasoning performance often masks deficiencies in low-level symbolic perception. This phenomenon underscores a key limitation of existing training paradigms: while models excel at leveraging large-scale continual natural images, they struggle to construct stable, compositional visual representations of abstract, discrete symbols.」という面白い指摘。

コメントを残す

コメントを残す コメントをキャンセル