行動計画 – arXiv最新論文の紹介

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention

Distilling On-device Language Models for Robot Planning with Minimal Human Intervention [117.9]
PRISMは、SLM(Small Language Model)対応ロボットプランナーを蒸留するためのフレームワークである。 PRISMを3つのLCM対応プランナーに適用し、マッピング、探索、操作、家事支援を行う。 GPT-4o の 10-20% から 93% 以上まで, PRISM は Llama-3.2-3B の性能を向上することを示した。
論文参考訳（メタデータ） (Fri, 20 Jun 2025 21:44:27 GMT)
robot planningを対象とした「Given a source LLM-enabled planner, PRISM synthesizes tasks and environments, elicits plans from the LLM-enabled planner in these synthesized environments, and then uses the resulting data to train an SLM-enabled planner that serves as a drop-in replacement for the source model.」という蒸留フレームワークの提案。直観的にも有効そうだが実際有望な結果。
プロジェクトサイトはPRISM

Visual Planning: Let’s Think Only with Images

Visual Planning: Let’s Think Only with Images [30.7]
我々は、特に空間的情報や幾何学的情報を含むタスクにおいて、言語が推論において最も自然で効果的なモダリティであるとは限らないと論じる。そこで本研究では,テキストから独立して,純粋に視覚的な表現によるプランニングを可能にする,ビジュアルプランニングという新たなパラダイムを提案する。このパラダイムでは、計画は視覚領域におけるステップバイステップの推論を符号化する一連の画像を通して実行される。
論文参考訳（メタデータ） (Fri, 16 May 2025 16:17:22 GMT)
「By enabling models to operate entirely through visual state transitions without textual mediation, we demonstrate that purely visual representations can lead to more effective and intuitive planning,」とのこと。テキストは強力だが万能というわけではなくタスクによっては計画レベルで画像が有効なことがあるのは納得感がある。とても面白い。GRITでも思ったが画像の力を使っていくアプローチはとても有望に思える。
リポジトリはGitHub – yix8/VisualPlanning: Visual Planning: Let’s Think Only with Images

GRIT: Teaching MLLMs to Think with Images [22.7]
Grounded Reasoning with Images and Texts (GRIT) はMLLMを画像で考えるための新しい手法である。 GRITは自然言語と明示的な境界ボックス座標をインターリーブする推論連鎖を生成する。 GRITは例外的なデータ効率を実現し、既存のデータセットから20のイメージクエスト・アンサートレットを必要とする。
論文参考訳（メタデータ） (Wed, 21 May 2025 17:54:49 GMT)
プロジェクトサイトはGRIT: Teaching MLLMs to Think with Images

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking

HyperTree Planning: Enhancing LLM Reasoning via Hierarchical Thinking [109.1]
提案するHyperTree Planning(HTP)は,高木構造プランニングアウトラインを構成する新しい推論パラダイムである。実験ではHTPの有効性を実証し、Gemini-1.5-ProによるTravelPlannerベンチマークで最先端の精度を実現し、o1-previewよりも3.6倍の性能向上を実現した。
論文参考訳（メタデータ） (Mon, 05 May 2025 02:38:58 GMT)
「Compared to previous tree planning methods such as ToT (Yao et al , 2024) and RAP (Hao et al , 2023), HTP introduces structural innovations that enable each edge to connect multiple child nodes, making it suitable for a divide-and-conquer strategy.」という特徴を持つHyperTreeを使った行動計画の提案。
効果が高いよう。通常のツリーよりも強力な構造であるのは確かだろうがLLMも扱いやすいという点が面白い。（いろいろ書ける）自然言語に似ている・・・？

Self-Steering Language Models

Self-Steering Language Models [114.0]
DisCIPLは、”セルフステアリング(self-steering)”言語モデルのメソッドである。 DisCIPLはPlannerモデルを使用してタスク固有の推論プログラムを生成する。我々の研究は、高度に並列化されたモンテカルロ推論戦略の設計空間を開く。
論文参考訳（メタデータ） (Wed, 09 Apr 2025 17:54:22 GMT)
「This paper introduces DISCIPL, a method for “self-steering” LMs where a Planner model generates a task-specific inference program that is executed by a population of Follower models.」というアプローチの紹介。
「By decomposing reasoning into planning and execution, our architecture preserves flexibility while enabling orchestration of highly efficient, parallel search patterns.」というのは経験的にも有効そうに思う。検証がしっかりされているのはありがたい。

A Survey on Large Language Models for Automated Planning / Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems [

A Survey on Large Language Models for Automated Planning [15.8]
自動計画における大規模言語モデルの利用に関する既存の研究を批判的に調査する。これらの制限のため、LCMは独立したプランナーとして機能するには適していないが、他のアプローチと組み合わせることで、計画アプリケーションを強化する大きな機会を提供する。
論文参考訳（メタデータ） (Tue, 18 Feb 2025 02:11:03 GMT)
LLMを用いた自動計画に関するサーベイ
エージェントでは必須の能力であるが、このテーマでのサーベイは貴重

Beyond Self-Talk: A Communication-Centric Survey of LLM-Based Multi-Agent Systems [11.5]
大規模言語モデル(LLM)は、最近、推論、計画、意思決定において顕著な能力を示した。研究者はLLMをマルチエージェントシステムに組み込んで、単一エージェント設定の範囲を超えてタスクに取り組むようになった。この調査はさらなるイノベーションの触媒として機能し、より堅牢でスケーラブルでインテリジェントなマルチエージェントシステムを促進する。
論文参考訳（メタデータ） (Thu, 20 Feb 2025 07:18:34 GMT)
マルチエージェント、コミュニケーションに軸足を置いたサーベイ。

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving

PlanGEN: A Multi-Agent Framework for Generating Planning and Reasoning Trajectories for Complex Problem Solving [89.6]
制約,検証,選択という3つの重要な要素を持つモデルに依存しない,スケーラブルなエージェントフレームワークであるPlanGENを提案する。具体的には、推論時間アルゴリズムの性能を向上させるために、制約誘導反復検証を提案する。
論文参考訳（メタデータ） (Sat, 22 Feb 2025 06:21:56 GMT)
「PlanGEN comprises three specialized LLM agents: a constraint agent, a verification agent, and a selection agent.」というマルチエージェントフレームワーク。「Further, we introduced a Mixture of Algorithms, an iterative framework that integrates the selection agent (Figure 1) to dynamically choose the best algorithm.」とのことだが、MoAのAがAgentのものと紛らわしい。。
Gemini-1.5-Pro, Gemini-2.0-Flash, GPT-4o、それぞれ単一で使うよりも性能が向上しているようでアンサンブル的な効果は出ている。

Agent Planning with World Knowledge Model

Agent Planning with World Knowledge Model [88.5]
エージェント計画を容易にするためにパラメトリック世界知識モデル(WKM)を導入する。我々はWKMを開発し、グローバルな計画と動的状態の知識を導くために、事前のタスク知識を提供する。我々は、我々のWKMが視覚障害者の試行錯誤と幻覚的行動の問題を効果的に緩和できることを示すために分析を行った。
論文参考訳（メタデータ） (Thu, 23 May 2024 06:03:19 GMT)
World Knowledge Modelが計画に有効とのこと。それ自体は納得的でWKMを得るために「Specifically, we first steer the agent model to synthesize task knowledge from the comparison between expert and sampled trajectories. Then we prompt it to summarize state knowledge for each planning step from expert trajectories and combine the previous and next actions to build a state knowledge base. Lastly, we integrate the generated knowledge into expert trajectories and train a WKM.」という手順をとる。この手の設計が重要になっている。
リポジトリはhttps://github.com/zjunlp/WKMとのことだが、現時点では４０４

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks

Plan-Seq-Learn: Language Model Guided RL for Solving Long Horizon Robotics Tasks [50.3]
Plan-Seq-Learn (PSL) は、抽象言語と学習した低レベル制御の間のギャップを埋めるためにモーションプランニングを使用するモジュラーアプローチである。 PSLは85%以上の成功率、言語ベース、古典的、エンドツーエンドのアプローチを達成している。
論文参考訳（メタデータ） (Thu, 02 May 2024 17:59:31 GMT)
今なお難しい長期計画のためのフレームワークの提案。自然言語による高レベルな計画と、それを実現するための「Sequencing Module 」「Learning Module」からなる。
リポジトリはPlan-Seq-Learn (mihdalal.github.io)

TPTU-v2

TPTU-v2: Boosting Task Planning and Tool Usage of Large Language Model-based Agents in Real-world Systems [25.9]
本稿では,大規模言語モデル(LLM)のタスク計画・ツール利用(TPTU)能力の向上を目的とした包括的フレームワークを提案する。このフレームワークは、これらの課題に対処するために設計された3つの重要なコンポーネントで構成されている。(1) API Retrieverは、利用可能な広範囲な配列の中で、ユーザタスクに最も関連するAPIを選択し、(2) LLM Finetunerは、タスク計画とAPI呼び出しにより適するように、ベースLSMをチューニングし、(3)Demo Selectorは、難しいAPIに関連するさまざまなデモを適応的に検索する。
論文参考訳（メタデータ） (Sun, 19 Nov 2023 12:37:30 GMT)
TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents – arXiv最新論文の紹介 (devneko.jp)のv2、3ヶ月で更新という今のスピード感。
API Retriever、LLM Finetuner、Demo Selectorからなる構成、ToolBenchの結果は高いように思えるが詳細な情報が欲しいところ。。

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents

TPTU: Task Planning and Tool Usage of Large Language Model-based AI Agents [17.2]
大規模言語モデル(LLM)は、様々な現実世界のアプリケーションのための強力なツールとして登場した。 LLMの本質的な生成能力は、その長所にもかかわらず、複雑なタスクを扱うには不十分である。本稿では,LLMベースのAIエージェントに適した構造化フレームワークを提案する。
論文参考訳（メタデータ） (Mon, 7 Aug 2023 09:22:03 GMT)
LLM-based AI AgentsのTPTU（Task Planning and Tool Usage）能力を測るフレームワークの提案。実務上も有用で未来を感じる能力。現状では商用製品（ChatGPT、Claude）が強い。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31