2025年7月 – ページ 5 – arXiv最新論文の紹介

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents [45.5]
大規模言語モデル(LLM)の最近の進歩は、自律型AIエージェントの台頭を触媒している。これらの大きなモデルエージェントは、静的推論システムからインタラクティブなメモリ拡張エンティティへのパラダイムシフトを示す。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 13:34:34 GMT)
AIエージェントとセキュリティリスクに関するサーベイ。
検討ポイントが多い。。

Scaling RL to Long Videos

Scaling RL to Long Videos [107.4]
LongVILA-R1-7B は VideoMME などの長いビデオ QA ベンチマークで高い性能を発揮する。 LongVILA-R1は、視覚言語モデルにおけるロングビデオ推論に向けての第一歩となる。各種モダリティのRLトレーニングをサポートする,一般公開のためのトレーニングシステムをリリースする。
論文参考訳（メタデータ） (Thu, 10 Jul 2025 17:47:40 GMT)
「(1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling.」を使用しての長い動画を理解するためのフレームワークの提案
「Unlike domains such as math or code reasoning, where structured supervision and benchmarks are readily available [7, 8], long video reasoning requires annotating complex temporal dynamics, goals, spatial relations, and narrative elements—often across minutes or hours of footage」と、コード生成や数学的推論とは異なる難しさがある。
リポジトリはGitHub – NVlabs/Long-RL: Long-RL: Scaling RL to Long Sequences

AI4Research: A Survey of Artificial Intelligence for Scientific Research

AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5]
我々はAI for Research(AI4Research)に関する総合的な調査を行う。まず、AI4Researchの5つの主要なタスクを分類する系統分類を導入する。主要な研究ギャップを特定し、将来有望な方向性を明らかにする。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:19:20 GMT)
ResearchへのAI適用に関するサーベイ。下記を主要タスクとしている。
- (1) AI for Scientific Comprehension
- (2) AI for Academic Surveys
- (3) AI for Scientific Discovery
- (4) AI for Academic Writing
- (5) AI for Academic Reviewing
プロジェクトサイトはAI4Research: A Survey of Artificial Intelligence for Scientific Research

CritiQ: Mining Data Quality Criteria from Human Preferences

CritiQ: Mining Data Quality Criteria from Human Preferences [70.4]
人間の嗜好からデータ品質の基準を自動的にマイニングする新しいデータ選択手法であるCritiQを紹介する。 CritiQ Flowはマネージャエージェントを使用して品質基準を進化させ、ワーカーエージェントはペアで判断する。コード,数学,論理領域において,本手法の有効性を実証する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 09:58:59 GMT)
「We introduce CritiQ 1, a novel data selection method that automatically mines criteria from human preferences for data quality with only ∼30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.」と非常に小規模なデータから始めるデータ選択（アノテーション効率化）手法の提案。
リポジトリはGitHub – KYLN24/CritiQ: Repository of the paper ”CritiQ: Mining Data Quality Criteria from Human Preferences”. Code for CritiQ Flow & Training CritiQ Scorer.

GTA1: GUI Test-time Scaling Agent

GTA1: GUI Test-time Scaling Agent [77.6]
本稿ではGUIテストタイムスケーリングエージェントGTA1の2つの課題について検討する。まず、最も適切なアクション提案を選択するために、テスト時間スケーリング手法を提案する。第2に、選択したアクション提案を対応する視覚要素にグラウンドする際の精度の向上を実現するモデルを提案する。
論文参考訳（メタデータ） (Tue, 08 Jul 2025 08:52:18 GMT)
Salesforce researchによるGUIエージェントの提案、OSWorldなどでSoTAを主張
「i) test-time scaling for planning, which introduces a scaling strategy during inference to effectively handle planning ambiguity in complex GUI environments; ii) grounding model training, filtering out training samples with annotation errors to improve supervision quality, and optimizing a grounding model using RL (e g , GRPO) to directly predict coordinates without relying on any intermediate “thinking” (i. e., CoT reasoning) on the derived data.」という工夫を行っている。UI-TARS-1.5-7B, Qwen2.5-VL-32B-Instruct, Qwen2.5-VL-72B-InstructをPost Trainingしているが、やはりこの手のチューニングを行わないと厳しいタスクなのだろうか・・・
リポジトリはGitHub – Yan98/GTA1

Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop

Benchmarking Generalizable Bimanual Manipulation: RoboTwin Dual-Arm Collaboration Challenge at CVPR 2025 MEIS Workshop [120.3]
RoboTwin Dual-Arm Collaboration Challengeは、CVPR 2025の第2回MeISワークショップで行われた。ライバルは、剛性、変形性、触覚ベースのシナリオをカバーする17のデュアルアーム操作タスクに完全に取り組んだ。コンペティションの設定、タスク設計、評価方法論、重要な発見と今後の方向性について概説する。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 17:56:41 GMT)
「RoboTwin Dual-Arm Collaboration Challenge at the 2nd MEIS Workshop, CVPR 2025」の紹介
プロジェクトサイトはRoboTwin Dual-Arm Collaboration Challenge

PresentAgent: Multimodal Agent for Presentation Video Generation

PresentAgent: Multimodal Agent for Presentation Video Generation [30.3]
長文文書をナレーション付きプレゼンテーションビデオに変換するマルチモーダルエージェントであるPresentAgentを提案する。この統合を実現するために、PresentAgentでは、インプットドキュメントのセグメント化、計画、スライドスタイルのビジュアルフレームのレンダリングを行うモジュールパイプラインを採用している。このようなマルチモーダルなアウトプットの評価の複雑さを考慮し,ビジョンランゲージモデルを用いた統合評価フレームワークであるPresentEvalを紹介する。
論文参考訳（メタデータ） (Sat, 05 Jul 2025 13:24:15 GMT)
プレゼンテーションビデオを作成するエージェント
リポジトリはGitHub – AIGeeksGroup/PresentAgent: PresentAgent: Multimodal Agent for Presentation Video Generation

Grok 4, Phi4-mini-Flash-Reasoning, SmolLM3, Kimi-K2, T5Gemma

先週も様々なモデルが発表されたが、注目は様々なベンチマークで強力な性能を主張するGrok 4だろう（Grok 4 | xAI）。Humanity’s Last Examで44.4%と非常に強力に見える。

オープンなモデルとしてはモデル構造が面白いPhi4-mini-Flash-Reasoning（Reasoning reimagined: Introducing Phi-4-mini-flash-reasoning | Microsoft Azure Blog、論文は後述）、HuggingFaceの小型モデルSmolLM3（SmolLM3, GitHub – huggingface/smollm: Everything about the SmolLM and SmolVLM family of models）、総パラメータ1T / 32 B Activeと極端なMoE構成で非常に高性能なKimi-K2（GitHub – MoonshotAI/Kimi-K2: Kimi K2 is the large language model series developed by Moonshot AI team、Kimi K2）など興味深い発表が相次いだ。また、T5Gemma: A new collection of encoder-decoder Gemma models – Google Developers Blogにも要注目。Decoder onlyでないアーキテクチャの良さが現れるタスクも多そうに思う。

Encoder-Decoder Gemma: Improving the Quality-Efficiency Trade-Off via Adaptation [52.2]
我々は,デコーダのみの大規模言語モデルをエンコーダ-デコーダモデルに適応させるという,新しい問題を研究する。適応はデコーダのみのLLMの能力を継承するだけでなく、計算の需要を減らすことができると主張している。同様の推論予算の下では、エンコーダ-デコーダ LLM は(しばしばより優れた)事前訓練性能を達成できるが、デコーダのみの性能よりもはるかに優れた微調整性能が得られる。
論文参考訳（メタデータ） (Tue, 08 Apr 2025 17:13:41 GMT)

Decoder-Hybrid-Decoder Architecture for Efficient Reasoning with Long Generation [129.5]
我々は、レイヤ間の効率的なメモリ共有のためのシンプルで効果的なメカニズムであるGated Memory Unit(GMU)を紹介した。これは、GMUを組み込んでSambaベースのセルフデコーダからメモリ読み出し状態を共有するデコーダ・ハイブリッド・デコーダアーキテクチャである。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 07:27:00 GMT)
Phi4-mini-Flash-Reasoningの論文
「Our decoder-hybrid-decoder architecture taking Samba [RLL+25] as the self-decoder. Gated Memory Units (GMUs) are interleaved with the cross-attention layers in the cross-decoder to reduce the decoding complexity. As in YOCO [SDZ+24], the full attention layer only need to compute the KV cache during prefilling with the self-decoder, leading to linear computation complexity for the prefill stage.」と計算量的に有利なアーキテクチャでLRMに適しているように見える。

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities [1584.5]
Gemini 2.5 Proは私たちの最も有能なモデルであり、フロンティアコーディングと推論ベンチマークでSoTAのパフォーマンスを実現しています。 Gemini 2.5 Flashは計算とレイテンシの要求のごく一部で優れた推論機能を提供する。 Gemini 2.0 FlashとFlash-Liteは低レイテンシと低コストでハイパフォーマンスを提供する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 17:36:04 GMT)
Gemini 2.5の論文も出ていた。共著者の人数がすごい（3300人以上）。

SciMaster: Towards General-Purpose Scientific AI Agents, Part I. X-Master as Foundation: Can We Lead on Humanity’s Last Exam? [47.2]
本稿では,人間研究者をエミュレートするツール強化推論エージェントであるX-Masterを紹介する。 XマスターズはHumanity’s Last Examに32.1%のスコアで最新記録を樹立した。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 17:50:52 GMT)
Agenticなアプローチ＋DeepSeek-R1-0528でHumanity’s Last Exam 32.1%を達成という報告。ベースモデルとしてGrok 4を使った場合のスコアが気になるところ。
リポジトリはGitHub – sjtu-sai-agents/X-Master: Official implementation of X-Master, a general-purpose tool-augmented reasoning agent.

Frontier LLMs Still Struggle with Simple Reasoning Tasks

Frontier LLMs Still Struggle with Simple Reasoning Tasks [53.5]
この研究は、フロンティア言語モデルの性能を、幅広い「容易」推論問題に対して研究する。計算,一階述語論理,証明木,旅行計画など,手続き的に生成された単純な推論タスクのスイートを作成します。最先端の思考モデルでさえ、このような問題や同様の理由で一貫して失敗することを示します。
論文参考訳（メタデータ） (Wed, 09 Jul 2025 22:22:49 GMT)
「By extending previous work in the literature, we create a suite of procedurally generated simple reasoning tasks, including counting, first-order logic, proof trees, and travel planning, with changeable parameters (such as document length. or the number of variables in a math problem) that can arbitrarily increase the amount of computation required to produce the answer while preserving the fundamental difficulty. While previous work showed that traditional, non-thinking models can be made to fail on such problems, we demonstrate that even state-of-the-art thinking models consistently fail on such problems and for similar reasons (e g , statistical shortcuts, errors in intermediate steps, and difficulties in processing long contexts).」と簡単だがLLM/LRMによって解きにくいタスクを作成。
「Similarly to other recent works, our results suggest that LLMs mimic training data rather than performing true reasoning, making it relatively easy to find out-of-distribution problems where the models fail, and this problem is also present at the newest thinking models. This suggests that users remain careful when relying on the output of LLMs.」と指摘している。下記のCatAttackの時も感じたがLLM/LRMは人間の能力とはかなり異なっていることは意識したほうが良いと思う。
リポジトリはhttps://github.com/google-deepmind/unpuzzles_and_simple_reasoning/とのこと

Cats Confuse Reasoning LLM: Query Agnostic Adversarial Triggers for Reasoning Models [25.1]
本稿では,問合せに依存しない逆引き金を導入することで,段階ごとの問題解決を訓練した推論モデルのロバスト性について検討する。より弱く安価なプロキシモデル上でトリガを生成する自動反復攻撃パイプラインであるCatAttackを提案する。我々の研究結果は、推論モデルにおける重大な脆弱性を浮き彫りにして、最先端モデルでさえ、微妙な敵の入力に影響を受けやすいことを明らかにした。
論文参考訳（メタデータ） (Mon, 03 Mar 2025 18:10:54 GMT)
「For example, appending, Interesting fact: cats sleep most of their lives, to any math problem leads to more than doubling the chances of a model getting the answer wrong. Our findings highlight critical vulnerabilities in reasoning models, revealing that even state-of- the-art models remain susceptible to subtle adversarial inputs, raising security and reliability concerns.」という面白い攻撃。一方で、ノイズ（無関係）な事例がRAGの改善に有効という話もあり動作は本当に謎。
リポジトリはcollinear-ai/cat-attack-adversarial-triggers · Datasets at Hugging Face

The Power of Noise: Redefining Retrieval for RAG Systems [19.4]
Retrieval-Augmented Generation (RAG) は、大規模言語モデルの事前学習知識を超えて拡張する方法として登場した。我々は、RAGソリューションが取得すべきパスIRシステムの種類に焦点を当てる。
論文参考訳（メタデータ） (Wed, 1 May 2024 08:15:07 GMT)
「Finally, and even more surprisingly, random, noisy documents are actually helpful in increasing the accuracy of these systems when correctly positioned within a prompt.」と無関係な事例が有効なのは興味深い

MemOS: A Memory OS for AI System, MIRIX: Multi-Agent Memory System for LLM-Based Agents

RAGでは厳しい問題を扱うためのMemory関連の研究がとても盛ん。

MemOS: A Memory OS for AI System [115.3]
大規模言語モデル(LLM)は、人工知能(AGI)にとって不可欠な基盤となっている。既存のモデルは、主に静的パラメータと短命なコンテキスト状態に依存しており、ユーザの好みを追跡したり、長い期間にわたって知識を更新する能力を制限する。 MemOSはメモリを管理可能なシステムリソースとして扱うメモリオペレーティングシステムである。
論文参考訳（メタデータ） (Fri, 04 Jul 2025 17:21:46 GMT)
MemOS: An Operating System for Memory-Augmented Generation (MAG) in Large Language Models – arXiv最新論文の紹介からのアップデート、AgenticなアプローチのLLM用メモリ。時系列性など通常のRAGでは簡単ではない部分の性能向上が大きい。（が、「To ensure architectural parity, all methods are implemented over the same LLM backbone (GPT-4o-mini)」とベースモデルがGPT-4o miniで良いのかは若干謎ではある）
リポジトリはGitHub – MemTensor/MemOS: MemOS (Preview) | Intelligence Begins with Memory

MIRIX: Multi-Agent Memory System for LLM-Based Agents [7.1]
MIRIXは言語モデルのためのモジュール型マルチエージェントメモリシステムである。 MIRIXは、リッチな視覚的およびマルチモーダル体験を受け入れるためにテキストを超越する。 MIRIXはメモリ拡張LDMエージェントの新たなパフォーマンス標準を設定している。
論文参考訳（メタデータ） (Thu, 10 Jul 2025 17:40:11 GMT)
こちらもAgenticなアプローチのメモリ管理フレームワーク。ベースモデルが異なるためMemOSと直接比較が困難だが、他システムと比べ高い性能を主張。
リポジトリはGitHub – Mirix-AI/MIRIX: Mirix is a multi-agent personal assistant designed to track on-screen activities and answer user questions intelligently. By capturing real-time visual data and consolidating it into structured memories, Mirix transforms raw inputs into a rich knowledge base that adapts to your digital experiences.

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions [19.5]
メモリ機構を持つエージェントをメモリエージェントと呼ぶ。本稿では,メモリエージェントに不可欠な4つのコア能力,すなわち,正確な検索,テスト時間学習,長距離理解,コンフリクト解決の4つを同定する。既存のデータセットは、限られたコンテキスト長に依存するか、書籍ベースのQAのような静的で長いコンテキスト設定用に調整されている。既存のベンチマークでは4つの能力をすべてカバーしていないため、メモリエージェント用に特別に設計された新しいベンチマークであるMemoryAgentBenchを紹介します。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 17:59:54 GMT)
こちらはMemoryを持つエージェントのためのベンチマークの提案
「we identify four core competencies essential for memory agents: accurate retrieval, test-time learning, long-range understanding, and conflict resolution.」とのこと。
結果にある「While Mem0 has demonstrated relatively strong performance on conversational tasks such as LOCOMO—where information density is comparatively low—it tends to perform poorly on benchmarks containing dense informational content, including RULER and ∞-Bench. For tasks emphasizing Time-to-Live (TTL) and Least Recently Used (LRU) retrieval, these limitations are often even more pronounced.」という指摘は興味深く、ドメインを選ばない汎用的な構造を作るのは大変そうという印象。
リポジトリはai-hyz/MemoryAgentBench · Datasets at Hugging Face、GitHub – HUST-AI-HYZ/MemoryAgentBench: Open source code for Paper: Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31