改善のための「First, integrating dedicated symbolic modules is essential: the LogiCAM outperforms base VLMs precisely because it extracts multimodalities based on logic and embeds explicit symbolic reasoning steps. Second, existing VLMs struggle to align and fuse visual and textual information when performing formal logic; Future work should explore tighter multimodal integration, such as cross-modal architectures trained with logic-grounded objectives, to bridge this gap.」という指摘が興味深く、現行モデルは形式的な処理に苦労しているように見える。