RoboOmni: Proactive Robot Manipulation in Omni-modal Context

RoboOmni: Proactive Robot Manipulation in Omni-modal Context [165.1]
我々は,音声対話や環境音,視覚的手がかりから意図を導出する,クロスモーダルな文脈指示を導入する。目的認識,インタラクション確認,アクション実行を統一する,エンドツーエンドのOmni-Modal LLMに基づくフレームワークであるRoboOmniを提案する。シミュレーションと実世界の設定の実験では、Robo OmniはテキストベースとASRベースのベースラインを越え、成功率、推論速度、意図認識、積極的に支援している。
論文参考訳（メタデータ） (Mon, 27 Oct 2025 18:49:03 GMT)
「There arises a key research question: Can a robot integrate cross-modal context, including speech, environmental audio, and visual observations, to proactively infer and verify user intent?」という疑問に対してのマルチモーダルモデル「we propose RoboOmni, an end-to-end omni-modal framework for manipulation that closes the loop of intent recognition, interaction confirmation, and action execution. Unlike prior approaches, RoboOmni supports direct speech interaction without ASR, infers latent commands by fusing human speech, environmental audio, and vision through spatiotemporal modeling, and verifies intent via interaction.」
プロジェクトサイトはRoboOmni: Proactive Robot Manipulation in Omni-modal Context

コメントを残す

コメントを残す コメントをキャンセル