Action – arXiv最新論文の紹介

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback

LAM SIMULATOR: Advancing Data Generation for Large Action Model Training via Online Exploration and Trajectory Feedback [121.8]
AIエージェントのための大規模アクションモデル(LAM)は、素晴らしいポテンシャルを提供するが、高品質なトレーニングデータを必要とするため、課題に直面している。 LAM SIMULATORは,高品質なフィードバックによるエージェントタスクのオンライン探索を目的とした総合的なフレームワークである。本フレームワークは,動的タスククエリジェネレータ,広範囲なツールコレクション,および大規模言語モデル(LLM)エージェントがツールを呼び出し,リアルタイムフィードバックを受信できる対話型環境を備えている。
論文参考訳（メタデータ） (Mon, 02 Jun 2025 22:36:02 GMT)
LAM SIMULATOR, a comprehensive frame- work designed for online exploration of agentic tasks with high-quality feedback

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [134.0]
汎用ロボットには多目的体と知的な心が必要だ。近年のヒューマノイドロボットの進歩は、汎用的な自律性を構築するためのハードウェアプラットフォームとして大きな可能性を秘めている。我々はヒューマノイドロボットのオープン基盤モデルであるGR00T N1を紹介する。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 21:06:21 GMT)
NVIDIAによるヒューマノイドロボットをターゲット（「GR00T N1, an open foundation model for generalist humanoid robots」）としたVison-Language-Actionモデルの提案。「We design a compositional model that integrates Vision-Language Model (VLM)-based reasoning module (System 2) and Diffusion Transformer (DiT)-based action module (System 1) in a unified learning framework;」という構成。
リポジトリはGitHub – NVIDIA/Isaac-GR00T: NVIDIA Isaac GR00T N1 is the world’s first open foundation model for generalized humanoid robot reasoning and skills.、nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim · Datasets at Hugging Face

TACO: Learning Multi-modal Action Models with Synthetic Chains-of-Thought-and-Action [103.6]
複雑・多段階・多モードタスクの性能向上を目的とした多モード大規模アクションモデルであるTACOを提案する。推論中、TACOはチェーン・オブ・シント・アンド・アクション(CoTA)を生成し、OCR、深さ推定、電卓などの外部ツールを呼び出すことで中間ステップを実行する。このデータセットにより、TACOは複雑な推論とアクションパスを学習し、直接回答だけでチューニングデータに基づいてトレーニングされた既存のモデルを上回ることができる。
論文参考訳（メタデータ） (Sat, 07 Dec 2024 00:42:04 GMT)
「Our TACO model is able to output a Chain-of Thought-and-Action (CoTA) and answer challenging questions based on the thoughts and action outputs」というモデルの提案。マルチモーダルなAction付きのモデル。GPT-4oなどを使って構築した合成データを活用とのこと。
プロジェクトサイトはTACO

OS-ATLAS: A Foundation Action Model for Generalist GUI Agents [55.4]
OS-AtlasはGUIグラウンディングとOODエージェントタスクに優れた基礎的なGUIアクションモデルである。現在までに1300万以上のGUI要素を含む、オープンソースのクロスプラットフォームGUI基盤コーパスをリリースしています。
論文参考訳（メタデータ） (Wed, 30 Oct 2024 17:10:19 GMT)
GUIを対象としたFoundation Action Modelの提案、Anthropicの発表もあって盛り上がっている領域。性能は「although GPT-4o with OS-Atlas-Base as the grounding module still lags behind human performance, it significantly outperforms other grounding methods such as SeeClick and Set-of-Mark (SoM)」とのこと。
リポジトリはOS-Atlas Homepage

Latent Action Pretraining from Videos [156.9]
一般行動モデル(LAPA)のための潜在行動事前訓練について紹介する。 LAPA(英: LAPA)は、VLA(Vision-Language-Action)モデルに接地型ロボットアクションラベルを含まない教師なしの訓練方法である。本稿では,ロボットアクションラベルを持たないインターネット規模のビデオから学習する手法を提案する。
論文参考訳（メタデータ） (Tue, 15 Oct 2024 16:28:09 GMT)
インターネットにあるようなビデオデータからVLAを構築する手法の提案、「Across three benchmarks spanning both simulation and real-world robot experiments, we show that our method significantly improves transfer to downstream tasks compared to existing approaches.」とのこと
プロジェクトサイトはLAPA