SpatialTree: How Spatial Abilities Branch Out in MLLMs
SpatialTree: How Spatial Abilities Branch Out in MLLMs [109.3] 低レベル知覚(L1)、メンタルマッピング(L2)、シミュレーション(L3)、エージェント能力(L4)の4つのレベルに空間能力を整理する認知科学に着想を得た階層を導入する。 複雑な推論には役立ちますが、直感的な知覚を損ないます。 本稿では,不必要な熟考を抑制するシンプルな自己思考戦略を提案する。 論文参考訳(メタデータ) (Tue, 23 Dec 2025 18:59:46 GMT)
「Spatial abilities refer to the capacity to perceive, understand, reason about, and interact with 2D and 3D space, a long-standing topic in cognitive science [13, 45, 48]. In multimodal large language models (MLLMs), these abilities form the cornerstone of Spatial Intelligence (SI), yet remain challenging to study systematically due to their inherent complexity and broad scope [31, 63].」とのことでSpatial abilitiesを測るベンチマークを構築している。4レベルは下記の通り。
L1 Perception: This level focuses on native perception of space, capturing raw geometric and physical attributes such as size, distance, and motion, without relying on language or symbolic reasoning.
L2 Mental Mapping: This level maps spatial perception to language, grounding spatial concepts in linguistic semantics and forming language-structured spatial memory.
L3 Mental Simulation: This level supports internal reasoning about space, enabling mental simulation, including causal reasoning about dynamics, relational and geometric problem solving, and sequential planning for actions and navigation.
L4 Spatial Agent: This level executes actions in space, integrating perception, language, and reasoning to interact with the environment, interpret feedback, and complete long-horizon spatial tasks.