Autonomous Agent – ページ 10 – arXiv最新論文の紹介

ResearchTown: Simulator of Human Research Community

ResearchTown: Simulator of Human Research Community [14.0]
ResearchTownは、リサーチコミュニティシミュレーションのためのマルチエージェントフレームワークである。 ResearchTownは、協調研究活動の現実的なシミュレーションを提供する。 ResearchTownは、複数の研究者と多様な論文で堅牢なシミュレーションを維持できる。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 18:26:53 GMT)
流行っているマルチエージェントフレームワーク、だが、ついにTownに。。。
グラフ構造を変更するとどうなるかに興味津々
リポジトリはGitHub – ulab-uiuc/research-town: A platform for developers to simulate research community

PC Agent: While You Sleep, AI Works — A Cognitive Journey into Digital World

PC Agent: While You Sleep, AI Works — A Cognitive Journey into Digital World [19.0]
PC Agentは、人間の認知伝達を通じて、このビジョンに向けて重要なステップを示すAIシステムである。この仮説を検証するために、我々は3つの重要な革新を紹介した。 PowerPointのプレゼンテーション作成における予備的な実験は、少量の高品質な認知データで複雑なデジタル作業機能を実現することができることを示している。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 14:02:12 GMT)
「trained on just 133 cognitive trajectories, can handle sophisticated work scenarios involving up to 50 steps across multiple applications.」という手法の提案。非常に効率的な印象。「In conclusion, we presented a cognition transfer framework that efficiently guides AI to the digital world through three key components: PC Tracker for collecting human-computer interaction data, a two-stage post-processing for cognition completion, and a multi-agent system for computer task automation.」とあるが、社会実装の上ではPC Tracker周りでいろいろとトラブルが起きそう。この手の操作データは誰に所属するべきなんだろう。
リポジトリはPC Agent: While You Sleep, AI Works – A Cognitive Journey into Digital World

DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought [89.5]
DRT-o1は、長いチェーン・オブ・シークレットの成功をニューラルマシン翻訳(MT)にもたらす試みである。まず、既存の文献から模範文や比喩文を含む文を抽出し、その後、長い思考を通してこれらの文を翻訳する多エージェントフレームワークを開発する。文献翻訳実験の結果, DRT-o1の有効性が示された。
論文参考訳（メタデータ） (Mon, 23 Dec 2024 11:55:33 GMT)
Chain of thoughtの機械翻訳への応用、データを収集・マルチエージェントフレームワークでのデータ合成、fine tuningというアプローチ。14Bで124 GPU hoursは思ったよりも少ない印象だが、性能は大きく向上している。
プロジェクトサイトはGitHub – krystalan/DRT-o1: DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought

GUI Agents: A Survey

GUI Agents: A Survey [129.9]
グラフィカルユーザインタフェース(GUI)エージェントは、人間とコンピュータのインタラクションを自動化するためのトランスフォーメーションアプローチとして登場した。 GUIエージェントの関心の高まりと基本的な重要性により、ベンチマーク、評価指標、アーキテクチャ、トレーニングメソッドを分類する総合的な調査を提供する。
論文参考訳（メタデータ） (Wed, 18 Dec 2024 04:48:28 GMT)
GUIをつかうエージェントに関するサーベイ

Think&Cite、RAG-Star

Think&Cite: Improving Attributed Text Generation with Self-Guided Tree Search and Progress Reward Modeling [64.0]
大型言語モデル(LLM)は幻覚を起こし、事実的に誤った情報を生み出す傾向にある。我々はThink&Citeと呼ばれる新しいフレームワークを提案し、検索と統合された多段階推論問題として属性付きテキスト生成を定式化する。
論文参考訳（メタデータ） (Thu, 19 Dec 2024 13:55:48 GMT)
エビデンス付きのテキスト生成のためSelf-Guided Monte Carlo Tree Search (SG-MCTS)を提案。モンテカルロツリーを使って性能を上げようという取り組みは多いが「To the best of our knowledge, we are the first to apply tree search algorithms to the task of attributed text generation.」はそうかもしれない。
RAGなどを上回る性能を達成とのこと。有効な手法に思える。

RAG-Star: Enhancing Deliberative Reasoning with Retrieval Augmented Verification and Refinement [85.1]
既存の大規模言語モデル(LLM)は、例外的な問題解決能力を示すが、複雑な推論タスクに苦労する可能性がある。検索情報を統合した新しいRAG手法である RAG-Star を提案する。 Llama-3.1-8B-Instruct と GPT-4o を併用した実験により,RAG-Star は従来のRAG と推理法を著しく上回っていることが示された。
論文参考訳（メタデータ） (Tue, 17 Dec 2024 13:05:36 GMT)
「RAG-Star employed Monte Carlo Tree Search to search intermediate sub-queries and corresponding answers. Moreover, RAG-Star introduced retrieval-augmented verification to evaluate the plausibility and consistency of the planned subqueries and answers based on a query-aware and an answer-aware reward.」とこちらはRAGにMonte Carlo Tree Searchを組み合わせるタイプの報告

A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios

A Survey on Large Language Model-Based Social Agents in Game-Theoretic Scenarios [44.0]
ゲーム理論のシナリオは、Large Language Model(LLM)ベースのソーシャルエージェントの社会的インテリジェンスを評価する上で重要なものとなっている。本調査では,研究成果をゲームフレームワーク,ソーシャルエージェント,評価プロトコルの3つのコアコンポーネントにまとめる。
論文参考訳（メタデータ） (Thu, 05 Dec 2024 06:46:46 GMT)
ゲーム理論な文脈でのLLM based Agentsのサーベイ。

From Intention To Implementation: Automating Biomedical Research via LLMs

From Intention To Implementation: Automating Biomedical Research via LLMs [32.0]
本稿では,バイオメディカル研究プロセス全体を合理化するために設計された,初のエンドツーエンド自動システムであるBioResearcherを紹介する。複雑なタスクを論理的に関連するサブタスクに分解することで、BioResearcherは多分野要求と論理複雑性の課題を効果的に解決する。 BioResearcherは8つの未測定研究目標に対して平均実行成功率63.07%を達成している。
論文参考訳（メタデータ） (Thu, 12 Dec 2024 16:35:05 GMT)
「BioResearcher employs a modular multi-agent architecture, integrating specialized agents for search, literature processing, experimental design, and programming.」とのこと。
解釈が難しい数値とはいえ、達成率はかなり高い印象。。。

The BrowserGym Ecosystem for Web Agent Research

The BrowserGym Ecosystem for Web Agent Research [151.9]
BrowserGymエコシステムは、Webエージェントの効率的な評価とベンチマークの必要性の高まりに対処する。大規模なマルチベンチマークWebエージェント実験を初めて実施する。結果は、OpenAIとAnthropicの最新モデルの大きな相違点を浮き彫りにしている。
論文参考訳（メタデータ） (Fri, 06 Dec 2024 23:43:59 GMT)
WEBエージェント開発のためのベンチマーク環境、あわせてベンチマークの統合とAgentLabも公開している。現在のリーダーボード（BrowserGym Leaderboard – a Hugging Face Space by ServiceNow）によると、Claude 3.5 Sonnetの性能の高さが目立っている。
リポジトリはGitHub – ServiceNow/BrowserGym: 🌎💪 BrowserGym, a Gym environment for web task automation、GitHub – ServiceNow/AgentLab: AgentLab: An open-source framework for developing, testing, and benchmarking web agents on diverse tasks, designed for scalability and reproducibility.

Large Language Model-Brained GUI Agents: A Survey

Large Language Model-Brained GUI Agents: A Survey [43.2]
マルチモーダルモデルはGUI自動化の新しい時代を支えてきた。彼らは自然言語理解、コード生成、視覚処理において例外的な能力を示した。これらのエージェントはパラダイムシフトを表しており、ユーザーは単純な会話コマンドで複雑なマルチステップタスクを実行できる。
論文参考訳（メタデータ） (Wed, 27 Nov 2024 12:13:39 GMT)
GUI Agents with Foundation Models: A Comprehensive Survey – arXiv最新論文の紹介ににたサーベイだが、こちらはMicrosoftの研究者が筆頭著者。

Model Context Protocol (MCP), QwQ, OLMo 2

先週も様々なニュースがあったが、注目はAnthropicのModel Context Protocolである。　Introducing the Model Context Protocol \ Anthropic、Introduction – Model Context Protocol

ザックリとはLLMと外部データやツールを統合するためのプロトコルである。外部ツール利用やメモリの拡張利用などを前提としたLLMを構築する場合、この手の標準があるかないかは重要。MCPがデファクトスタンダードとなれるか興味津々。

公開モデル関連では極めて性能の高いQwen with Questions（QwQ）、以前取り上げたDolmaとOLMo – arXiv最新論文の紹介のver 2であるOLMo 2に要注目である。O1 Replication JurneyやTULU3もだが、どのような手法、アプローチで性能が上がるのかなどをオープンにした取り組みの価値は高い。

QwQ: Reflect Deeply on the Boundaries of the Unknown | Qwen
- 「QwQ-32B-Preview is an experimental research model developed by the Qwen Team, focused on advancing AI reasoning capabilities.」という公開モデル。Open AI o1と比較しても性能が高い。o1に刺激を受けた動きは様々行われていて本当に競争が激しい。
- リポジトリはQwen/QwQ-32B-Preview · Hugging Face
- デモはQwQ-32B-Preview – a Hugging Face Space by Qwen
OLMo 2: The best fully open language model to date | Ai2
- 構築方法、データ、モデルが公開されているモデルであり、性能は最先端に近い。
- リポジトリはOLMo 2 – a allenai Collection
- デモはAi2 Playground

O1 Replication Journey — Part 2: Surpassing O1-preview through Simple Distillation, Big Progress or Bitter Lesson? [30.9]
本稿では,OpenAIのO1モデル機能を複製する現在のアプローチについて,批判的な考察を行う。 O1のAPIからの単純な蒸留と教師付き微調整を組み合わせることで、複雑な数学的推論タスクにおいて優れた性能が得られることを示す。
論文参考訳（メタデータ） (Mon, 25 Nov 2024 15:31:27 GMT)
OpenAI o1に関する研究、Fugu-MT 論文翻訳(概要): O1 Replication Journey: A Strategic Progress Report — Part 1からのPart2。「While our previous work (Part 1 (Qin et al , 2024)) explored the fundamental technical path to O1 replication, this study reveals how simple distillation from O1’s API, combined with supervised fine-tuning, can achieve superior performance on complex mathematical reasoning tasks.」はまぁいいとして「Notably, despite training only on mathematical problem-solving data, our models demonstrated strong generalization to open-ended QA tasks and became significantly less susceptible to sycophancy after fine-tuning.」は驚き。
リポジトリはGitHub – GAIR-NLP/O1-Journey: O1 Replication Journey: A Strategic Progress Report – Part I

TÜLU 3: Pushing Frontiers in Open Language Model Post-Training [94.1]
我々は、完全にオープンな最先端の訓練後モデルであるT”ULU 3を紹介する。 T”ULU 3はLlama 3.1ベースモデルをベースにしており、Llama 3.1、Qwen 2.5、Mistral、さらにGPT-4o-mini、Claude 3.5-Haikuといったクローズドモデルにも勝っている。
論文参考訳（メタデータ） (Fri, 22 Nov 2024 18:44:04 GMT)
リポジトリはGitHub – allenai/open-instruct

2026年8月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31