Autonomous Agent – ページ 8 – arXiv最新論文の紹介

Harnessing Webpage UIs for Text-Rich Visual Understanding

Harnessing Webpage UIs for Text-Rich Visual Understanding [112.0]
テキストベース大規模言語モデル(LLM)を用いたWebページUIからの汎用マルチモーダル命令の合成を提案する。これらの命令はUIスクリーンショットと組み合わせて、マルチモーダルモデルのトレーニングを行う。我々は、100万のWebサイトから730万のサンプルを含むデータセットであるMultiUIを紹介し、多様なマルチモーダルタスクとUIレイアウトをカバーした。
論文参考訳（メタデータ） (Thu, 17 Oct 2024 17:48:54 GMT)
「We introduce MultiUI, a dataset containing 7.3 million samples from 1 million websites, covering diverse multimodal tasks and UI layouts.」というデータセットの構築と、それらデータを用いたMLLMの構築。
プロジェクトサイトはMultiUI、リポジトリはGitHub – neulab/MultiUI: Code for Paper: Harnessing Webpage Uis For Text Rich Visual Understanding

GenSim2

GenSim2: Scaling Robot Data Generation with Multi-modal and Reasoning LLMs [38.3]
GenSim2は、複雑で現実的なシミュレーションタスク作成のためのスケーラブルなフレームワークである。パイプラインは200のオブジェクトで最大100の調音タスクのデータを生成し、必要な人的労力を減らすことができる。我々は、生成したデータをゼロショット転送や実世界の収集データとの協調訓練に使用できる、GenSim2の有望な使用法を示す。
論文参考訳（メタデータ） (Fri, 04 Oct 2024 17:51:33 GMT)
(1) task proposal, (2) solver creation, (3) multi-task training, and (4) generalization evaluation and sim-to-real transfer.からなるフレームワークの提案。各所にLLM、MLLMを活用しながらデータ合成を行っていくアプローチ。（NLPのライブラリ gensimではない）
プロジェクトサイトはGenSim2: Scaling Robotic Data Generation with Multi-modal and Reasoning LLMs

GenSim: A General Social Simulation Platform with Large Language Model based Agents [110.4]
我々はtextitGenSim と呼ばれる新しい大規模言語モデル (LLM) ベースのシミュレーションプラットフォームを提案する。我々のプラットフォームは10万のエージェントをサポートし、現実世界のコンテキストで大規模人口をシミュレートする。我々の知る限り、GenSimは汎用的で大規模で修正可能な社会シミュレーションプラットフォームに向けた最初の一歩である。
論文参考訳（メタデータ） (Sun, 06 Oct 2024 05:02:23 GMT)
大規模なLLM based Agentのシミュレーションプラットフォーム（これもNLPのgemsimではない）
リポジトリはGitHub – TangJiakai/GenSim

Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement [117.9]
G”odel AgentはG”odelマシンにインスパイアされた自己進化型フレームワークである。 G”odel Agentは、パフォーマンス、効率、一般化性において手作業によるエージェントを上回る、継続的な自己改善を実現することができる。
論文参考訳（メタデータ） (Sun, 06 Oct 2024 10:49:40 GMT)
「we introduce G¨odel Agent, a self-evolving framework inspired by the G¨odel machine, enabling agents to recursively improve themselves without relying on predefined routines or fixed optimization algorithms.」と自己改善していけるエージェントを提案、効果を確認とのこと。エージェント的改善を行っていくフレームワークでLLM自体を改善するような実装ではなさそう。
「Currently, G¨odel Agent is not sufficiently stable and may be prone to error accumulation, hindering its ability to continue self-optimization.」とのことではあるが、この手の研究が進んでいくのは未来を感じる。
リポジトリはGitHub – Arvid-pku/Godel_Agent: Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory [96.4]
大規模言語モデル(LLM)のための文書レバレッジ翻訳エージェントであるDelTAを紹介する。 DelTAは、様々な粒度とスパンにまたがる情報を格納するマルチレベルメモリ構造を備えている。実験結果から,DelTAは翻訳の一貫性や品質において,強いベースラインを著しく上回ることがわかった。
論文参考訳（メタデータ） (Thu, 10 Oct 2024 17:30:09 GMT)
LLMを利用した機械翻訳エージェント。Proper Noun Records、Bilingual Summary、Long-Term Memory、Short-Term Memoryを持つ。
リポジトリはGitHub – YutongWang1216/DocMTAgent: Code and data releases for the paper — DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Agent S: An Open Agentic Framework that Uses Computers Like a Human [31.2]
我々は、GUI(Graphical User Interface)を通じてコンピュータとの自律的なインタラクションを可能にするオープンエージェントフレームワークであるAgent Sを提案する。 Agent Sは、ドメイン固有の知識の取得、長いタスクの水平線の計画、動的で一様でないインターフェイスの処理という、コンピュータタスクの自動化における3つの重要な課題に対処することを目指している。
論文参考訳（メタデータ） (Thu, 10 Oct 2024 17:43:51 GMT)
人が操作するようにコンピュータを操作するエージェントフレームワークの提案
リポジトリはGitHub – simular-ai/Agent-S: Official codebase for Agent S, a open agentic framework that uses computers like a human

Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models

Can-Do! A Dataset and Neuro-Symbolic Grounded Framework for Embodied Planning with Large Multimodal Models [85.6]
具体的計画能力を評価するために設計されたベンチマークデータセットであるCan-Doを紹介する。私たちのデータセットには400のマルチモーダルサンプルが含まれており、それぞれが自然言語のユーザ指示、環境を描写した視覚イメージ、状態変化、対応するアクションプランで構成されています。ニューログラウンド(NeuroGround)は、まず認識された環境状態において計画生成を基礎とし、次に象徴的な計画エンジンを活用してモデル生成計画を強化する、ニューログラウンド(NeuroGround)を提案する。
論文参考訳（メタデータ） (Sun, 22 Sep 2024 00:30:11 GMT)
多様なシナリオでの具体的計画能力を測るマルチモーダルなデータセットとこれらを解くためにシンボリックエンジンを活用するNeuroGroundの提案。
リポジトリはCan-Do! A Dataset for Embodied Planning with Large Multimodal Models (embodied-planning.github.io)

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions

HAICOSYSTEM: An Ecosystem for Sandboxing Safety Risks in Human-AI Interactions [76.4]
本稿では,多様な複雑な社会的相互作用におけるAIエージェントの安全性を調べるフレームワークであるHAICOSYSTEMを提案する。私たちは7つの領域(医療、金融、教育など)にわたる92のシナリオに基づいて1840のシミュレーションを実行します。我々の実験は、最先端のLSMは、プロプライエタリかつオープンソースの両方で、50%以上のケースで安全リスクを示すことを示した。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 19:47:21 GMT)
AIエージェントの安全性を確かめるフレームワークの提案
プロジェクトサイトはAN ECOSYSTEM FOR SANDBOXING SAFETY RISKS IN HUMAN-AI INTERACTIONS (haicosystem.org)

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning

Improving Autonomous AI Agents with Reflective Tree Search and Self-Learning [78.4]
Reflective Monte Carlo Tree Search (R-MCTS)は、AIエージェントの能力を高めるために設計された新しいテストタイムアルゴリズムである。 R-MCTSは1)従来のMCTSを拡張し、対照的な反射を取り入れ、エージェントは過去の相互作用から学ぶことができる。自己学習によりGPT-4oを微調整することでエージェントの性能を向上させる。
論文参考訳（メタデータ） (Wed, 02 Oct 2024 21:42:35 GMT)
「We propose Reflective Monte Carlo Tree Search (R-MCTS), an extension of classic MCTS that improves the agent’s decision making process on the fly by incorporating reflection over its past task executions, and state estimations using multi-agent-debate」というタイプのモンテカルロ木探索の提案と、それによるSFTでベンチマーク結果を改善。ToTや単純なMCTSより優れた結果。
リポジトリはjasonyux/RMCTS-self-learning · GitHub

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale

Synatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale [97.2]
LLMは、デジタル環境と対話し、特定の目的を完遂する自律エージェントとして機能する。デジタルタスクに対する大規模な直接的なデモが欠如していることもあって、正確性はまだ十分ではない。我々は、この間接的な知識を大規模に直接監督するアプローチであるSynatraを提案する。
論文参考訳（メタデータ） (Tue, 24 Sep 2024 00:51:45 GMT)
複雑なタスクを対象としてAgentがとるべき行動を合成するアプローチの提案。マニュアル等で「キーワードを入力する」と書かれているような曖昧な箇所をLLMで補間することが性能向上寄与するという話のよう。Agentの限界（人間との違い）を感じるとともに合成データの有効性、LLMの強力さを感じる。
「We use 100k such synthetically-created demonstrations to finetune a 7B CodeLlama, and demonstrate that the resulting agent surpasses all comparably sized models on three web-based task benchmarks Mind2Web, MiniWoB++ and WebArena, as well as surpassing GPT-3.5 on WebArena and Mind2Web.」と有効性を確認。「In addition, while synthetic demonstrations prove to be only 3% the cost of human demonstrations (at $0.031 each), we show that the synthetic demonstrations can be more effective than an identical number of human demonstrations collected from limited domains.1」コストパフォーマンスも優れる。
リポジトリはSynatra: Turning Indirect Knowledge into Direct Demonstrations for Digital Agents at Scale (oootttyyy.github.io)

Agents in Software Engineering: Survey, Landscape, and Vision

Agents in Software Engineering: Survey, Landscape, and Vision [46.0]
大規模言語モデル(LLM)は目覚ましい成功を収め、下流の様々なタスクで広く使われてきた。 LLMとソフトウェア工学(SE)を組み合わせた多くの研究では、明示的にも暗黙的にもエージェントの概念が採用されている。本稿では,知覚,記憶,行動の3つの重要なモジュールを含む,SE における LLM ベースのエージェントのフレームワークを提案する。
論文参考訳（メタデータ） (Fri, 13 Sep 2024 17:55:58 GMT)
Large Language Model-Based Agents for Software Engineering: A Survey – arXiv最新論文の紹介 (devneko.jp)とは別のチームによるソフトウェアエンジニアリングにおけるエージェント活用のサーベイ。エージェント側の技術に注目したものになっている。
リポジトリはGitHub – DeepSoftwareAnalytics/Awesome-Agent4SE

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31