2025年8月18日 – arXiv最新論文の紹介

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Seeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory [11.7]
本稿では,長期記憶を備えた新しいフレームワークであるM3-Agentを紹介する。 M3-Agentは、リアルタイムの視覚および聴覚入力を処理して、長期記憶の構築と更新を行うことができる。我々は,M3-Benchという長ビデオ質問応答ベンチマークを開発した。
論文参考訳（メタデータ） (Wed, 13 Aug 2025 12:03:03 GMT)
こちらも長期記憶を備えたエージェントフレームワークの提案。「Compared to the strongest baseline, Gemini-GPT4o-Hybrid, which implements M3-Agent framework by prompting Gemini-1.5-Pro [41] for memorization and GPT-4o [15] for control, M3-Agent improves accuracy by 6.7%, 7.7%, and 5.3% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively. Our ablation study demonstrates the importance of semantic memory: removing it reduces accuracy by 17.1%, 19.2% and 13.1% on M3-Bench-robot, M3-Bench-web, and VideoMME-long, respectively.」と効果を報告している。
プロジェクトサイトはSeeing, Listening, Remembering, and Reasoning: A Multimodal Agent with Long-Term Memory

Memp: Exploring Agent Procedural Memory [72.4]
LLM(Large Language Models)ベースのエージェントは様々なタスクをこなすが、静的パラメータで手動で設計または絡み合うような不安定なプロシージャメモリに悩まされる。本稿では,過去のエージェントの軌跡をステップバイステップの細粒度と高レベルなスクリプトライクな抽象化の両方に蒸留するMempを提案する。メモリレポジトリが洗練されるにつれて、エージェントは着実に高い成功率と類似タスクの効率を達成できることを示す。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 16:20:56 GMT)
エージェントへのMemory導入、「Empirical results on housework automation and information-seeking bench- marks show that leveraging procedural memory significantly boosts task success rates and efficiency. Beyond improving individual episodes, Memp supports continual learning and robust generalization, marking a step toward self-improving, resilient agents.」とのこと。
メモリ管理はシンプルに行っているように見える。

OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use [101.6]
アイアンマンの架空のJ.A.R.V.I.Sほど有能で多用途なAIアシスタントを作る夢は、長い間想像力に恵まれてきた。マルチモーダル(multi-modal)な大きな言語モデル((M)LLMs)の進化により、この夢は現実に近づいている。本調査は,OSエージェント研究の現状を整理し,学術調査と産業開発の両方の指針を提供する。
論文参考訳（メタデータ） (Wed, 06 Aug 2025 14:33:45 GMT)
「The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multimodal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e g , computers and mobile phones) by operating within the environments and interfaces (e g , Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced.」から始まるサーベイ。
リポジトリはOS Agents: A Survey on MLLM-based Agents for Computer, Phone and Browser Use (ACL 2025)

AI in Agriculture: A Survey of Deep Learning Techniques for Crops, Fisheries and Livestock [78.0]
作物、漁業、家畜が世界の食料生産のバックボーンを形成し、成長を続ける世界の人口を養うのに不可欠である。これらの問題に対処するには、効率的で正確でスケーラブルな技術ソリューションが必要であり、人工知能(AI)の重要性を強調している。本調査では,従来の機械学習アプローチ,高度なディープラーニング技術,最新のビジョン言語基礎モデルなど,200以上の研究成果を体系的かつ徹底的にレビューする。
論文参考訳（メタデータ） (Tue, 29 Jul 2025 17:59:48 GMT)
農業分野におけるAI活用のサーベイ

AgroBench: Vision-Language Model Benchmark in Agriculture [25.5]
AgroBenchは、視覚言語モデル(VLM)を7つの農業トピックにわたって評価するためのベンチマークである。私たちのAgroBenchは、203の作物カテゴリと682の病気カテゴリを含む最先端のカテゴリをカバーし、VLM能力を徹底的に評価しています。
論文参考訳（メタデータ） (Mon, 28 Jul 2025 04:58:29 GMT)
こちらは農業分野のベンチマーク
リポジトリはAgroBehch

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models [194.6]
GLM-4.5はオープンソースのMixture-of-Experts(MoE)大言語モデルであり,総パラメータは355B,アクティベートパラメータは32Bである。 23Tトークンのマルチステージトレーニングと、エキスパートモデルのイテレーションと強化学習による総合的なポストトレーニングを通じて、GLM-4.5はエージェント、推論、コーディングタスクにわたって強力なパフォーマンスを実現している。 GLM-4.5(355Bパラメータ)とGLM-4.5-Air(106Bパラメータ)をそれぞれリリースし、推論とエージェントAIシステムの研究を進めた。
論文参考訳（メタデータ） (Fri, 08 Aug 2025 17:21:06 GMT)
GLM-4.5（GLM-4.5, Step-3, Falcon-H1, HunyuanWorld – arXiv最新論文の紹介）の論文。性能の割にパラメータ（特にアクティブパラメータ）が少ない。詳細に比較しないと何とも言えないところではあるが、GPT-OSSとの比較が気になるところ。
リポジトリはGitHub – zai-org/GLM-4.5: GLM-4.5: An open-source large language model designed for intelligent agents by Z.ai