arXiv – ページ 16 – arXiv最新論文の紹介

DeepAgent: A General Reasoning Agent with Scalable Toolsets

DeepAgent: A General Reasoning Agent with Scalable Toolsets [111.6]
DeepAgentは、自律的な思考、ツール発見、アクション実行を実行するエンドツーエンドのディープ推論エージェントである。長期にわたる相互作用の課題に対処するために,過去の相互作用を構造化エピソード,動作,ツール記憶に圧縮する自律的メモリ折り畳み機構を導入する。 LLMシミュレートされたAPIを活用し、ツール呼び出しトークンにきめ細かいクレジットを割り当てるツールコールアドバンテージ属性を適用した、エンドツーエンドの強化学習戦略であるToolPOを開発した。
論文参考訳（メタデータ） (Fri, 24 Oct 2025 16:24:01 GMT)
ツール利用等も可能になるエージェントフレームワークの紹介。QwQ-32Bをバックボーンとして有効性を検証している。
リポジトリはGitHub – RUC-NLPIR/DeepAgent: 🛠️ DeepAgent: A General Reasoning Agent with Scalable Toolsets

ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases

ImpossibleBench: Measuring LLMs’ Propensity of Exploiting Test Cases [58.4]
タスク完了のための「ショートカット」は、大規模言語モデルの信頼性評価と展開に重大なリスクをもたらす。我々は,LLMエージェントがテストケースを利用するための正当性を測定するベンチマークフレームワークであるImpossibleBenchを紹介する。実践的なフレームワークとして、ImpossibleBenchは単なる評価ではなく、汎用的なツールである。
論文参考訳（メタデータ） (Thu, 23 Oct 2025 06:58:32 GMT)
「we introduce ImpossibleBench, a benchmark framework that systematically measures LLM agents’ propensity to exploit test cases.」と不正行為を測るためのベンチマーク。「frontier models frequently cheat when faced with these impossible tasks, and stronger models generally exhibit higher cheating rates.」という指摘が興味深いし感覚にも合う・・・
リポジトリはGitHub – safety-research/impossiblebench

ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows

ResearchGPT: Benchmarking and Training LLMs for End-to-End Computer Science Research Workflows [109.3]
CS-54k(CS-54k)は、コンピュータ科学におけるQ&Aペアの高品質なコーパスである。 CS-4kは、科学研究を支援するAIの能力を評価するためのベンチマークである。 CS-50kは大規模なトレーニングデータセットである。
論文参考訳（メタデータ） (Thu, 23 Oct 2025 07:07:35 GMT)
「We introduce CS-4k, the first benchmark that systematically evaluates the end-to-end research workflow in computer science through open-ended scientific question answering, offering a rigorous yardstick to assess LLMs’ ability to assist scientific research.」というベンチマーク。また、これらデータを用いたポストトレーニングの有効性を主張。
リポジトリはGitHub – wph6/ResearchGPT: Official repo for ReseachGPT

Human-AI Interactions: Cognitive, Behavioral, and Emotional Impacts

Human-AI Interactions: Cognitive, Behavioral, and Emotional Impacts [0.0]
過度な信頼感、認知的オフロード、社会的および感情的な操作、および人間の代理店の曖昧な劣化と判断の潜在的なリスクが強調される。観察によると、AIは記憶、創造性、エンゲージメントを大幅に向上させることができるが、批判的思考の減少、スキルの侵食、不安の増加といったリスクももたらしている。本稿は、人間中心の新たなリスクと利益のバランスをとるための、縦断的研究と評価フレームワークのギャップを浮き彫りにして、責任とコンテキストを意識したAI設計の必要性を明らかにすることを目的としている。
論文参考訳（メタデータ） (Mon, 20 Oct 2025 17:06:46 GMT)
人間とAIのかかわりに関してのサーベイ。リスク面で注意すべきかもしれない事例が多く紹介されている。

How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations

How Do AI Agents Do Human Work? Comparing AI and Human Workflows Across Diverse Occupations [112.6]
エージェントが人間とエージェントの労働者の直接比較を初めて提示することで、エージェントがどのように人間の仕事をするかを考察する。結果が88.3%速く、コストが90.4-96.2%低いことが判明した。
論文参考訳（メタデータ） (Sun, 26 Oct 2025 18:10:22 GMT)
人間とエージェントの比較、様々な課題も指摘されているが「Compared to an average human worker, agents deliver work 88.3–96.6% faster and at 90.4–96.2% lower costs. Our induced workflows naturally suggest a division of labor: readily programmable steps can be delegated to agents for efficiency, while humans handle the steps where agents fall short.」との結果はやや驚き。
- 「One quarter of human activities we studied involve AI tools, with most used for augmentation purposes: integrating AI into existing workflows with minimal disruption, while improving efficiency by 24.3%. In contrast, AI automation markedly reshapes workflows and slows human work by 17.7%, largely due to additional time spent on verification and debugging (Figure 5).」はまぁそんなものか、という印象はあるが。。
ツールキットが公開されている。GitHub – zorazrw/workflow-induction-toolkit: A toolkit to induce interpretable workflows from raw computer-use activities.

Remote Labor Index: Measuring AI Automation of Remote Work [46.5]
AIは、研究指向の知識と推論のベンチマークを急速に進歩させたが、これらの成果が経済的価値と自動化にどのように変換されるかは、まだ不明である。これを測定するために、実世界の経済的に価値のあるプロジェクトからなる広範囲にわたるマルチセクタベンチマークであるRemote Labor Index (RLI)を導入する。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 17:58:04 GMT)
こちらは「RLI establishes an economically grounded measure of AI automation capacity, with 240 projects spanning 23 domains of digital freelance work, each anchored in demonstrated market value. Frontier AI agents perform near the floor on RLI, achieving an automation rate of less than 3%, revealing a stark gap between progress on computer use evaluations and the ability to perform real and economically valuable work.」と指摘。

Evaluating Long-Term Memory for Long-Context Question Answering

Evaluating Long-Term Memory for Long-Context Question Answering [100.1]
質問応答タスクにアノテートした合成長文対話のベンチマークであるLoCoMoを用いて,メモリ拡張手法の体系的評価を行う。以上の結果から,メモリ拡張アプローチによりトークン使用率が90%以上削減され,競争精度が向上した。
論文参考訳（メタデータ） (Mon, 27 Oct 2025 18:03:50 GMT)
長文におけるMemoryの有効性、「Our findings show that memory-augmented approaches re- duce token usage by over 90% while maintain- ing competitive accuracy. Memory architecture complexity should scale with model capability, with small foundation models benefitting most from RAG, and strong instruction-tuned rea- soning model gaining from episodic learning through reflections and more complex agentic semantic memory.」とモデルサイズの影響、タスクによるFull contextとの性能差も興味深い。

Co-Evolving Latent Action World Models, SPICE : Self-Play In Corpus Environments Improves Reasoning, Critique-RL, Parrot

先週、2つの異なるものを共に進化させ性能向上を図る論文が複数出ていた。このようなフレームワークとしてはGANが有名ではあるが、LLM basedな時代でもしばしば見るアプローチで非常に興味深い。

Co-Evolving Latent Action World Models [57.5]
学習済みのビデオモデルを潜在アクションを介して制御可能な世界モデルに適応させることは、ジェネラリストの世界モデルを作成するための有望なステップである。本稿では,この相乗的パラダイムを初めて実現したCoLA-Worldを提案する。世界モデルは知識のある家庭教師として機能し、高品質のLAMを形成するための勾配を提供する。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 12:28:40 GMT)
「We propose CoLA-World, the first framework that successfully enables joint training of a latent action model with a pre-trained video-generation-based world model.」とlatent action model (LAM) と world modelを共に生成

SPICE: Self-Play In Corpus Environments Improves Reasoning [58.8]
SPICEは、単一のモデルが2つの役割で機能する強化学習フレームワークである。チャレンジャーは、様々な推論タスクを生成するために、大きなコーパスから文書をマイニングする。本分析は,SPICEにおける文書の基盤化が,ますます困難な目標を連続的に生み出す上で,いかに重要な要素であるかを明らかにする。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 17:46:16 GMT)
「SPICE is a self-play framework where a single LLM, πθ, acts in two roles: a Challenger (role = C), which poses difficult questions, and a Reasoner (role = R), which tries to correctly answer such questions. The Challenger uses a raw document (which does not contain existing questions or labels) from a corpus to generate a (q, a∗) pair.」とChallengerとReasonerを使う強化学習フレームワーク

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning [89.6]
より強力な監督を伴わないクオリティク言語モデルを開発するためのオンラインRLアプローチであるCrytique-RLを提案する。提案手法は,アクターが応答を生成し,批評家がフィードバックを提供し,アクターがそれに応じて応答を洗練する,という2段階のパラダイムに基づいている。さまざまなタスクやモデルに対する実験では、Cristique-RLが大幅なパフォーマンス改善を実現している。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 11:37:01 GMT)
「In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic’s helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements.」と２ステージ構成の批評家モデルの強化（Actor側は更新されないので他とは異なるが）
リポジトリはGitHub – WooooDyy/Critique-RL

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning [69.0]
自然言語のチェーン・オブ・シント(N-CoT)とプログラム・チェーン・オブ・シント(P-CoT)は、数学的な推論問題を解決するために、大規模言語モデル(LLM)の2つの主要なパラダイムとして登場した。数学的問題に対する新しいトレーニングパイプラインであるParrotを提案する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 09:23:17 GMT)
Natural language chain-of-thought (N-CoT) とProgram chain-of-thought (P-CoT)の両強化、「The pipeline comprises three target-designed subtasks: Information Retrieval trains the model to concentrate on key information within problem. P-CoT Reasoning utilizes the information to generate variable well- defined code solutions. Paradigm Conversion enhances N-CoT with concise P-CoT and its intermediate outputs.」の3サブタスクを前提としている。

MiniMax M2, Kimi-Linear, Ling-V2, Ouro, Emu3.5, gpt-oss-safeguard

先週は公開モデルの話題が多く、その中でもMiniMax-M2 とKimi-Linearは要注目。特に後者は効率性も高い。先週のRingとややこしいが、Ling-V2も強力なモデルである（This report focuses on three reflex-grade non-thinking (instruct) models in the Ling 2.0 family—Ling-mini-2.0, Ling-flash-2.0, and Ling-1T. These models emphasize general reasoning and instruction-following capability, while the Ring series (Ling-Team, 2025), built upon the same Ling 2.0 base, extends toward deep thinking models.とのこと）。また、小型モデルであるOuro-2.6B 、Ouro-2.6B-Thinkingも興味深かった。

上記とは異なるがマルチモーダルなEmu3.5、分類タスク（safety classification tasks）用のgpt-oss-safeguardなど強力なモデルが公開されるのは良いことだと思う。（最後の例は想定活用例が他とはだいぶ異なりそうではあるが。。）

Kimi Linear: An Expressive, Efficient Attention Architecture [75.9]
Kimi Linearはハイブリッドな線形アテンションアーキテクチャで、初めて、公正な比較で完全にアテンションを上回ります。中心となるKimi Delta Attention (KDA)は、Gated DeltaNetを拡張した表現力のある線形アテンションモジュールである。我々は,Kimi Linearがより優れた性能と効率で十分な注意を払って,ドロップインで置き換えられることを示す。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 16:59:43 GMT)
「At its core lies Kimi Delta Attention (KDA), a hardware-efficient linear attention module that extends Gated DeltaNet [111] with a finer-grained gating mechanism. While GDN, similar to Mamba2 [16], employs a coarse head-wise forget gate, KDA introduces a channel-wise variant in which each feature dimension maintains an independent forgetting rate, akin to Gated Linear Attention (GLA) [114]. This fine-grained design enables more precise regulation of the finite-state RNN memory, unlocking the potential of RNN-style models within hybrid architectures.」をハイブリッド構成で活用。
GitHub – MoonshotAI/Kimi-Linear

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation [149.0]
Ling 2.0は、すべてのアクティベーションが推論能力を促進するという原則に基づいて構築された一連の推論指向の言語基盤である。 Ling 2.0は、経験的スケーリング法則によって導かれる、高い分散性、クロススケール一貫性、効率性を強調している。シリーズには、Ling-mini-2.0、Ling-flash-2.0、Ling-1Tの3つの非思考モデルが含まれている。
論文参考訳（メタデータ） (Sat, 25 Oct 2025 01:51:37 GMT)
長いReasoningにフォーカスしたRing-1Tとはことなり、一般的な推論や指示に従う能力にフォーカス
GitHub – inclusionAI/Ling-V2: Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.

Scaling Latent Reasoning via Looped Language Models [109.6]
事前学習されたループ言語モデル(LoopLM)のファミリーであるOuroを提示し、オープンソース化する。 Ouro は (i) 潜時空間における反復計算, (ii) 学習深度割り当てのためのエントロピー規則化された目的, (iii) 7.7T トークンへのスケーリングによる事前学習段階への推論を構築する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 17:45:42 GMT)
Looped Language Model (LoopLM) architectureによるモデル構築の報告。「we introduced Ouro, a family of Looped Language Models that demonstrate exceptional parameter efficiency by integrating iterative computation and adaptive depth directly into pre-training on 7.7T tokens. Our 1.4B and 2.6B models consistently match or exceed the performance of 4B and 8B standard transformers, showcasing a 2-3× efficiency gain.」と非常に効率が高い。
Ouro: Looped Language Models

Parallel Loop Transformer for Efficient Test-Time Computation Scaling [34.8]
大規模言語モデル(LLM)は強力だが、推論中に現実世界で使うには遅すぎるしコストもかかる。ループ変換器は、複数の計算ステップで同じ重みを再利用することでパラメータを節約する。ループが次々と実行され、各追加ループで推論遅延とメモリ要求が増大する。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 15:35:50 GMT)
こちらは並列のParallel Loop Transformer (PLT)

Emu3.5: Native Multimodal Models are World Learners [65.9]
Emu3.5は大規模マルチモーダル世界モデルで、視覚と言語をまたいだ次の状態をネイティブに予測する。 Emu3.5は、視覚言語間のインターリーブデータのコーパスに基づいて、一貫した次トーケン予測目標を持つ、エンドツーエンドで事前訓練された。それは、一貫した世界探索とオープンワールドの具体的操作を可能にする、一般化可能な世界モデリング能力を示す。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 15:11:16 GMT)
Emuシリーズ（Emu3: Next-Token Prediction is All You Need – arXiv最新論文の紹介）の最新版。「Emu3.5 further exhibits generalizable worldmodeling abilities encompassing world exploration and embodied manipulation, enabling controllable interaction, free-form navigation, and dynamic scene simulation across both real and imagined environments. We carefully evaluate these new capabilities and demonstrate clear superiority of Emu3.5, a single 32B unified model, over the closed-source Gemini 2.5 Flash Image [91].」とのこと。
emu.world/pages/web/landingPage、GitHub – baaivision/Emu3.5: Native Multimodal Models are World Learners

The Era of Agentic Organization: Learning to Organize with Language Models

The Era of Agentic Organization: Learning to Organize with Language Models [107.4]
我々は,非同期思考(AsyncThink)を大規模言語モデルを用いた推論の新しいパラダイムとして紹介する。実験では、AsyncThinkは並列思考に比べて28%低い推論遅延を実現している。 AsyncThinkは学習した非同期思考機能を一般化し、未確認タスクを追加のトレーニングなしで効果的に処理する。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 16:25:10 GMT)
マルチエージェントのように非同期処理を行えるフレームワーク。「In this work, we introduce asynchronous thinking (AsyncThink) as a new paradigm for reasoning with large language models, with the goal of learning to organize the internal thinking into con- currently executable structures. Specifically, we propose a thinking protocol where an LLM plays both roles: an organizer that dynamically structures the process through Fork and Join actions, and workers that execute sub-queries and return intermediate knowledge or results.」
プロジェクトサイトはAdvancing AI for Humanity

A Survey of AI Scientists: Surveying the automatic Scientists and Research

A Survey of AI Scientists: Surveying the automatic Scientists and Research [34.9]
人工知能は、計算機器から科学知識の自律的創始者へと大きく移行している。本調査では, エンド・ツー・エンドの科学的プロセスを, 文献レビュー, イデオロギー生成, 実験準備, 実験実施, 科学著作, 論文生成に分解する, 統合された6段階の方法論的枠組みを紹介する。
論文参考訳（メタデータ） (Mon, 27 Oct 2025 06:13:21 GMT)
「This survey provides a systematic and comprehensive synthesis of this emerging domain by introducing a unified, six-stage methodological framework that deconstructs the scientific process into: Literature Review, Idea Generation, Experimental Preparation, Experimental Execution, Scientific Writing, and Paper Generation. Through this analytical lens, we systematically map and analyze dozens of seminal works from 2022 to late 2025, revealing a clear three-phase evolutionary trajectory.」と科学へのAI活用に関するサーベイ。
リポジトリはGitHub – Mr-Tieguigui/Survey-for-AI-Scientist: A comprehensive survey for AI Scientist.

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31