2025年11月7日 – arXiv最新論文の紹介

RoboOmni: Proactive Robot Manipulation in Omni-modal Context

RoboOmni: Proactive Robot Manipulation in Omni-modal Context [165.1]
我々は,音声対話や環境音,視覚的手がかりから意図を導出する,クロスモーダルな文脈指示を導入する。目的認識,インタラクション確認,アクション実行を統一する,エンドツーエンドのOmni-Modal LLMに基づくフレームワークであるRoboOmniを提案する。シミュレーションと実世界の設定の実験では、Robo OmniはテキストベースとASRベースのベースラインを越え、成功率、推論速度、意図認識、積極的に支援している。
論文参考訳（メタデータ） (Mon, 27 Oct 2025 18:49:03 GMT)
「There arises a key research question: Can a robot integrate cross-modal context, including speech, environmental audio, and visual observations, to proactively infer and verify user intent?」という疑問に対してのマルチモーダルモデル「we propose RoboOmni, an end-to-end omni-modal framework for manipulation that closes the loop of intent recognition, interaction confirmation, and action execution. Unlike prior approaches, RoboOmni supports direct speech interaction without ASR, infers latent commands by fusing human speech, environmental audio, and vision through spatiotemporal modeling, and verifies intent via interaction.」
プロジェクトサイトはRoboOmni: Proactive Robot Manipulation in Omni-modal Context

Towards Stable and Effective Reinforcement Learning for Mixture-of-Experts [113.1]
オフ・ポリティクス強化学習(RL)における重要サンプリング重み付けを最適化する新しいルータ認識手法を提案する。具体的には、ルータロジットによって誘導される再スケーリング戦略を設計し、勾配のばらつきを効果的に低減し、トレーニングのばらつきを軽減する。実験により, 本手法は収束安定性とMoEモデルの最終的な性能の両方を著しく改善することが示された。
論文参考訳（メタデータ） (Mon, 27 Oct 2025 05:47:48 GMT)
MoEに対する強化学習のための「Router-Shift Policy Optimization (RSPO), an RL algorithm specifically designed for MoE architectures to achieve stable and efficient training.」を提案。

Rethinking Driving World Model as Synthetic Data Generator for Perception Tasks [33.7]
下流認識タスクを強化するための新しい合成データ生成フレームワークであるDream4Driveを紹介する。 Dream4Driveは入力ビデオを複数の3D対応誘導マップに分解し、これらの誘導マップに3Dアセットをレンダリングする。駆動世界モデルは、下流の知覚モデルをトレーニングするために使用できる編集されたマルチビュービデオを作成するために微調整される。
論文参考訳（メタデータ） (Fri, 24 Oct 2025 10:10:43 GMT)
「We propose Dream4Drive, a 3D-aware synthetic data generation framework that edits the video with dense guidance maps, producing synthetic data with diverse appearances and geometric consistency.」とデータ合成フレームワークの提案。
プロジェクトサイトはRethinking Driving World Model as Synthetic Data Generator for Perception Tasks