Test-Time Computing for Referring Multimodal Large Language Models

Test-Time Computing for Referring Multimodal Large Language Models [143.5]
そこで我々は,新しいテスト時間適応フレームワークである ControlMLLM++ を提案する。学習可能な視覚的プロンプトを凍ったマルチモーダルな大言語モデルに注入する。
論文参考訳（メタデータ） (Mon, 23 Feb 2026 04:42:10 GMT)
「We introduce ControlMLLM++, a novel test- time latent variable optimization framework that injects explicit visual prompts into frozen pre-trained MLLMs to enable referring capabilities without additional training.」とのこと。「ControlMLLM++ falls into this category, performing test-time optimization of latent perturbations to visual tokens to steer attention maps towards the referred region r.」というアプローチ。
リポジトリはGitHub – mrwu-mac/ControlMLLM: [NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models’

コメントを残す

コメントを残す コメントをキャンセル