2024年10月22日 – arXiv最新論文の紹介

Jailbreaking LLM-Controlled Robots

Jailbreaking LLM-Controlled Robots [82.0]
大規模言語モデル(LLM)は、文脈推論と直感的な人間とロボットの相互作用を可能にすることによって、ロボット工学の分野に革命をもたらした。 LLMは脱獄攻撃に弱いため、悪意のあるプロンプトはLLMの安全ガードレールをバイパスすることで有害なテキストを誘発する。 LLM制御ロボットをジェイルブレイクするアルゴリズムであるRoboPAIRを紹介する。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 15:55:36 GMT)
LLMが制御するロボットに対する脱獄攻撃、「(i) a white-box setting, wherein the attacker has full access to the NVIDIA Dolphins self-driving LLM, (ii) a gray-box setting, wherein the attacker has partial access to a Clearpath Robotics Jackal UGV robot equipped with a GPT-4o planner, and (iii) a black-box setting, wherein the attacker has only query access to the GPT-3.5-integrated Unitree Robotics Go2 robot dog. 」を設定、「In each scenario and across three new datasets of harmful robotic actions, we demonstrate that ROBOPAIR, as well as several static baselines, finds jailbreaks quickly and effectively, often achieving 100% attack success rates.」とのこと。。大きな脅威になりうる。
プロジェクトサイトはRoboPAIR

DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception [16.3]
速度の優位性を保ちながら精度を向上させる新しいアプローチであるDoc-YOLOを導入する。堅牢な文書事前学習には、Mesh-candidate BestFitアルゴリズムを導入する。モデル最適化の観点からは,グローバルからローカライズ可能な受信モジュールを提案する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 14:50:47 GMT)
多様なレイアウトデータを合成する手法、Mesh-candidate BestFit methodologyの提案とそれを用いた高速高性能なDocLayout-YOLOの提案。
リポジトリはGitHub – opendatalab/DocLayout-YOLO: DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception

Latent Action Pretraining from Videos [156.9]
一般行動モデル(LAPA)のための潜在行動事前訓練について紹介する。 LAPA(英: LAPA)は、VLA(Vision-Language-Action)モデルに接地型ロボットアクションラベルを含まない教師なしの訓練方法である。本稿では,ロボットアクションラベルを持たないインターネット規模のビデオから学習する手法を提案する。
論文参考訳（メタデータ） (Tue, 15 Oct 2024 16:28:09 GMT)
インターネットにあるようなビデオデータからVLAを構築する手法の提案、「Across three benchmarks spanning both simulation and real-world robot experiments, we show that our method significantly improves transfer to downstream tasks compared to existing approaches.」とのこと
プロジェクトサイトはLAPA

MRAG-Bench: Vision-Centric Evaluation for Retrieval-Augmented Multimodal Models [115.2]
MRAG-Benchというマルチモーダル検索拡張生成ベンチマークを導入する。 MRAG-Benchは16,130枚の画像と1,353個の人間による複数の質問からなる。その結果,すべての大規模視覚言語モデル (LVLM) は,テキスト知識と比較して画像で拡張すると改善が見られた。
論文参考訳（メタデータ） (Thu, 10 Oct 2024 17:55:02 GMT)
マルチモーダルなRAGのベンチマーク、様々なモデルのスコア一覧表もとても参考になる。
リポジトリはMRAG-Bench (mragbench.github.io)

MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models [49.8]
近年,Med-LVLM (Med-LVLMs) の進歩により,対話型診断ツールの新たな可能性が高まっている。 Med-LVLMは、しばしば事実の幻覚に悩まされ、誤った診断につながることがある。我々は,Med-LVLMの現実性を高めるために,多目的マルチモーダルRAGシステムMMed-RAGを提案する。
論文参考訳（メタデータ） (Wed, 16 Oct 2024 23:03:27 GMT)
医療ドメイン、かつ、マルチモーダルなRAGシステムの提案。ドメインを判別してRetireverを使い分けるなど凝った構成。「These enhancements significantly boost the factual accuracy of Med-LVLMs.」とのことで、この手の工夫は重要。
リポジトリはGitHub – richard-peng-xia/MMed-RAG: [arXiv’24 & NeurIPSW’24] MMed-RAG: Versatile Multimodal RAG System for Medical Vision Language Models