2025年8月 – ページ 3 – arXiv最新論文の紹介

Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges

Web3 x AI Agents: Landscape, Integrations, and Foundational Challenges [29.3]
Web3テクノロジとAIエージェントの収束は、分散化されたエコシステムを再形成する、急速に進化するフロンティアを表している。本稿では, ランドスケープ, 経済, ガバナンス, セキュリティ, 信頼メカニズムの5つの重要な側面について, Web3 と AI エージェントの交わりについて, 初めてかつ最も包括的な分析を行った。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 15:44:58 GMT)
「This paper presents the first comprehensive systematic analysis of Web3-AI agent integration, examining 133 active projects with $6.9 billion collective market capitalization to reveal how AI agents fundamentally reshape decentralized ecosystems across the landscape, finance, governance, security, and trust dimensions.」というサーベイ

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation

Shortcut Learning in Generalist Robot Policies: The Role of Dataset Diversity and Fragmentation [117.5]
Open X-Embodiment (OXE)のような大規模データセットでトレーニングされた汎用的なロボットポリシーは、幅広いタスクにわたって強力なパフォーマンスを示している。彼らはしばしば、トレーニングデータの分布を超えて一般化するのに苦労する。我々は,ショートカット学習を一般化の鍵となる障害として認識する。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 16:14:01 GMT)
「Our analysis reveals that large-scale robot datasets like OXE suffer from limited sub-dataset diversity and severe fragmentation, a problem that extends even within individual sub-datasets. This structure inherently promotes shortcut learning, meaning that simply adding more similarly-fragmented data can be detrimental to generalization.」とのこと。汎用的なモデル構築は難しい。
プロジェクトサイトはShortcut Learning in GRPs

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows

OdysseyBench: Evaluating LLM Agents on Long-Horizon Complex Office Application Workflows [10.3]
大規模言語モデル(LLM)は、複雑で長期の推論を必要とする現実世界のアプリケーションにますます多くデプロイされている。 OdysseyBenchは、様々なオフィスアプリケーションにわたる長期にわたってLLMエージェントを評価するための包括的なベンチマークである。スケーラブルなベンチマーク作成を実現するために,長期ワークフローベンチマークの自動生成を行うマルチエージェントフレームワークであるHomerAgentsを提案する。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 17:53:03 GMT)
「We introduce OdysseyBench, a comprehensive benchmark for evaluating agents on long- horizon workflows across multiple office applications, consisting of OdysseyBench+ and OdysseyBench-Neo. 」、「• We propose HOMERAGENTS, a multi-agent framework that automates the generation of long-horizon tasks, enabling scalable and diverse benchmark creation.」とベンチマーク作成フレームワークを含むベンチマークの提案。
リポジトリはhttps://github.com/microsoft/OdysseyBenchとのことだが現時点では404

Provable In-Context Vector Arithmetic via Retrieving Task Concepts

Provable In-Context Vector Arithmetic via Retrieving Task Concepts [53.7]
クロスエントロピー損失に対する勾配降下による非線形残差変圧器の訓練は,ベクトル演算による実-リコールICLタスクをいかに行うかを示す。これらの結果は、静的埋め込み前駆体よりもトランスフォーマーの利点を解明する。
論文参考訳（メタデータ） (Wed, 13 Aug 2025 13:54:44 GMT)
「We develop an optimization theory demonstrating that transformers with nonlinear softmax attention, MLP, layer normalization, and residual connections—trained via Gradient Descent (GD) with cross- entropy loss—can effectively perform factual-recall ICL in a vector arithmetic manner, grounded in empirically motivated data modeling. Our analysis shows that the transformer retrieves the high-level task/function concept through attention-MLP, which, when combined with any embedded query vector within the same high- level task concept, yields the correct corresponding answer vector.」とtask vectorを想定した理論的研究。
不明点はまだまだ多そうに思うが、理論的研究が進むことに期待。

Don’t Overthink It: A Survey of Efficient R1-style Large Reasoning Models

Don’t Overthink It: A Survey of Efficient R1-style Large Reasoning Models [49.6]
大規模共振モデル (LRM) は, 複雑なタスクの処理性能に優れていたため, 徐々に研究ホットスポットになりつつある。しかし、これらのモデルが広く適用されたことにより、過度に考え直すという問題が徐々に顕在化していった。モデル性能と推論能力を損なうことなく、推論経路の長さを短縮することを目的とした、様々な効率的な推論手法が提案されている。
論文参考訳（メタデータ） (Mon, 04 Aug 2025 06:54:31 GMT)
Reasoningの効率化に関するサーベイだが、すでに様々なアプローチと多くの研究成果があるのに驚き
リポジトリはyuelinan/Awesome-Efficient-R1-style-LRMs

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [11.7]
本稿では,長期記憶を備えた新しいフレームワークであるM3-Agentを紹介する。 M3-Agentは、リアルタイムの視覚および聴覚入力を処理して、長期記憶の構築と更新を行うことができる。我々は,M3-Benchという長ビデオ質問応答ベンチマークを開発した。
論文参考訳（メタデータ） (Wed, 13 Aug 2025 12:03:03 GMT)
こちらも長期記憶を備えたエージェントフレームワークの提案。「Compared to the strongest baseline, Gemini-GPT4o-Hybrid, which implements M3-Agent framework by prompting Gemini-1.5-Pro [41] for memorization and GPT-4o [15] for control, M3-Agent improves accuracy by 6.7%, 7.7%, and 5.3% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. Our ablation study demonstrates the importance of semantic memory: removing it reduces accuracy by 17.1%, 19.2% and 13.1% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively.」と効果を報告している。
プロジェクトサイトはSeeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Memp: Exploring Agent Procedural Memory

Memp: Exploring Agent Procedural Memory [72.4]
LLM(Large Language Models)ベースのエージェントは様々なタスクをこなすが、静的パラメータで手動で設計または絡み合うような不安定なプロシージャメモリに悩まされる。本稿では,過去のエージェントの軌跡をステップバイステップの細粒度と高レベルなスクリプトライクな抽象化の両方に蒸留するMempを提案する。メモリレポジトリが洗練されるにつれて、エージェントは着実に高い成功率と類似タスクの効率を達成できることを示す。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 16:20:56 GMT)
エージェントへのMemory導入、「Empirical results on housework automation and information-seeking bench- marks show that leveraging procedural memory significantly boosts task success rates and efficiency. Beyond improving individual episodes, Memp supports continual learning and robust generalization, marking a step toward self-improving, resilient agents.」とのこと。
メモリ管理はシンプルに行っているように見える。

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.6]
アイアンマンの架空のJ.A.R.V.I.Sほど有能で多用途なAIアシスタントを作る夢は、長い間想像力に恵まれてきた。マルチモーダル(multi-modal)な大きな言語モデル((M)LLMs)の進化により、この夢は現実に近づいている。本調査は,OSエージェント研究の現状を整理し,学術調査と産業開発の両方の指針を提供する。
論文参考訳（メタデータ） (Wed, 06 Aug 2025 14:33:45 GMT)
「The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multimodal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e g , computers and mobile phones) by operating within the environments and interfaces (e g , Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced.」から始まるサーベイ。
リポジトリはOS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use (ACL 2025)

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock / AgroBench: Vision-Language Model Benchmark in Agriculture

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock [78.0]
作物、漁業、家畜が世界の食料生産のバックボーンを形成し、成長を続ける世界の人口を養うのに不可欠である。これらの問題に対処するには、効率的で正確でスケーラブルな技術ソリューションが必要であり、人工知能(AI)の重要性を強調している。本調査では,従来の機械学習アプローチ,高度なディープラーニング技術,最新のビジョン言語基礎モデルなど,200以上の研究成果を体系的かつ徹底的にレビューする。
論文参考訳（メタデータ） (Tue, 29 Jul 2025 17:59:48 GMT)
農業分野におけるAI活用のサーベイ

AgroBench: Vision-Language Model Benchmark in Agriculture [25.5]
AgroBenchは、視覚言語モデル(VLM)を7つの農業トピックにわたって評価するためのベンチマークである。私たちのAgroBenchは、203の作物カテゴリと682の病気カテゴリを含む最先端のカテゴリをカバーし、VLM能力を徹底的に評価しています。
論文参考訳（メタデータ） (Mon, 28 Jul 2025 04:58:29 GMT)
こちらは農業分野のベンチマーク
リポジトリはAgroBehch

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [194.6]
GLM-4.5はオープンソースのMixture-of-Experts(MoE)大言語モデルであり,総パラメータは355B,アクティベートパラメータは32Bである。 23Tトークンのマルチステージトレーニングと、エキスパートモデルのイテレーションと強化学習による総合的なポストトレーニングを通じて、GLM-4.5はエージェント、推論、コーディングタスクにわたって強力なパフォーマンスを実現している。 GLM-4.5(355Bパラメータ)とGLM-4.5-Air(106Bパラメータ)をそれぞれリリースし、推論とエージェントAIシステムの研究を進めた。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 17:21:06 GMT)
GLM-4.5（GLM-4.5, Step-3, Falcon-H1, HunyuanWorld – arXiv最新論文の紹介）の論文。性能の割にパラメータ（特にアクティブパラメータ）が少ない。詳細に比較しないと何とも言えないところではあるが、GPT-OSSとの比較が気になるところ。
リポジトリはGitHub – zai-org/GLM-4.5: GLM-4.5: An open-source large language model designed for intelligent agents by Z.ai

2025年8月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31