DeepResearch – arXiv最新論文の紹介

Step-DeepResearch Technical Report

Step-DeepResearch Technical Report [90.5]
コスト効率のよいエンドツーエンドエージェントである Step-DeepResearch を紹介する。我々は、計画とレポート作成を強化するために、アトミック能力に基づくデータ合成戦略を提案する。中国における評価ギャップを埋めるため,現実的な深層研究シナリオのためのADR-Benchを構築した。
論文参考訳（メタデータ） (Tue, 23 Dec 2025 16:32:27 GMT)
StepFunによるディープリサーチエージェントと評価ベンチマークの提案。「Experimental results demonstrate that Step-DeepResearch, with only 32B parameters, achieves a high score of 61.4% on the Scale AI Research Rubrics. In expert human evaluations on ADR-Bench, its Elo score significantly outperforms comparable models and rivals state-of-the-art closed-source models such as OpenAI DeepResearch and Gemini DeepResearch.」と高性能を主張。実行にはAPI接続が必要でこれもclosedでは？と思わなくもない。。
リポジトリはGitHub – stepfun-ai/StepDeepResearch: Step-DeepResearch

Deep Research: A Systematic Survey

Deep Research: A Systematic Survey [118.8]
Deep Research (DR)は、大規模言語モデルの推論能力と検索エンジンなどの外部ツールを組み合わせることを目的としている。本調査は,深層研究システムの包括的かつ体系的な概要を提示する。
論文参考訳（メタデータ） (Mon, 24 Nov 2025 15:28:28 GMT)
Deep Resaerchに関するサーベイ。関連研究を含め幅広いサーベイになっている。引用論文リストからは（当然と言えば当然だが）2025年以降に非常に盛り上がっている状況が分かる。
リポジトリはGitHub – mangopy/Deep-Research-Survey: A Systematic Survey of Deep Research

How Far Are We from Genuinely Useful Deep Research Agents?

How Far Are We from Genuinely Useful Deep Research Agents? [48.6]
Deep Research Agents (DRA) は、反復的な情報検索と合成によってアナリストレベルのレポートを自動的に生成することを目的としている。レポート合成の現在のベンチマークは、タスクの複雑さと主観的なメトリクスに悩まされている。我々は,100個の人為的な研究タスクからなる改良されたベンチマークであるFINDER(FinDER)について述べる。
論文参考訳（メタデータ） (Mon, 01 Dec 2025 17:58:59 GMT)
「Fine-grained DEep- Research bench (FINDER), a fine-grained benchmark designed to evaluate DRAs in a more comprehensive manner. Unlike existing benchmarks, DEFT is built upon 100 expert-curated research tasks with 419 detailed check- list items that guide the structure, analytical depth, and citation integrity of generated reports.」というベンチマークの提案。
リポジトリはGitHub – OPPO-PersonalAI/FINDER_DEFT: Official implementation for paper “How Far Are We from Genuinely Useful Deep Research Agents?”

Claude Opus 4.5, DeepSeekMath-V2, DR Tulu, Qwen3-VL, HunyuanVideo 1.5

先週はOpus 4.5の発表（Introducing Claude Opus 4.5 \ Anthropic）があり、Anthropic Clodeが特にコード生成においてさすがの性能を見せた。

公開モデル関連では数学に強いDeepSeekMath-V2（deepseek-ai/DeepSeek-Math-V2 · Hugging Face）、Deep Researchに強いDR Tulu（DR Tulu: An open, end-to-end training recipe for long-form deep research | Ai2）やQwen3-VL、HunyuanVideo 1.5のテクニカルレポートに注目という状況。

DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research [152.2]
ディープ・リサーチ・モデルは、多段階の研究を行い、長文でよく理解された回答を生成する。ほとんどのオープンディープリサーチモデルは、検証可能な報酬を伴う強化学習を通じて、短い形式のQAタスクで訓練されている。我々は、オープンエンドで長期のディープリサーチのために直接訓練された最初のオープンモデルであるDeep Research Tulu (DR Tulu-8B)を開発した。
論文参考訳（メタデータ） (Wed, 26 Nov 2025 14:52:10 GMT)
「In this paper, we introduce Deep Research Tulu (DR Tulu-8B), the first open model that is directly trained for open-ended, long-form deep research tasks. To address the challenge of verification in long-form tasks, DR Tulu is first finetuned on high-quality, naturally occurring user data, and then trained via a new method we call Reinforcement Learning with Evolving Rubrics (RLER), in which we construct and maintain rubrics that co-evolve with the policy model during training.」とDeepResearchに特化したモデルの提案。強化学習部分も興味深い構成。
リポジトリはGitHub – rlresearch/dr-tulu: Official repository for DR Tulu: Reinforcement Learning with Evolving Rubrics for Deep Research

Qwen3-VL Technical Report [153.4]
Qwen3-VLは、これまでで最も有能な視覚言語モデルであり、幅広いマルチモーダルベンチマークで優れた性能を実現している。最大256Kトークンのインターリーブコンテキストをサポートし、テキスト、画像、ビデオをシームレスに統合する。 Qwen3-VLは3つの中核柱を提供する: (i) 非常に強い純粋テキスト理解、いくつかのケースにおいて同等のテキストのみのバックボーンを超える、 (ii) テキスト入力とインターリーブされたマルチモーダル入力の両方に256Kのネイティブウィンドウを持つ堅牢な長期理解、 (iii) シングルイメージ、マルチイメージ、ビデオタスクをまたいだ高度なマルチモーダル推論。
論文参考訳（メタデータ） (Wed, 26 Nov 2025 17:59:08 GMT)
「The Qwen3-VL framework integrates a vision encoder and a language model decoder to process multimodal inputs, including text, images, and video. The vision encoder is specifically designed to handle dynamic, native-resolution visual inputs, mapping them to visual tokens of variable length.」という構成、商用モデルと比較可能な性能、一部は上回る。
リポジトリはGitHub – QwenLM/Qwen3-VL: Qwen3-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.

HunyuanVideo 1.5 Technical Report [97.0]
HunyuanVideo 1.5は軽量だが強力なオープンソースビデオ生成モデルである。最先端のビジュアル品質とモーションコヒーレンスを、わずか830億のパラメータで達成している。すべてのオープンソース資産はhttps://github.com/Tencent-Hunyuan/HunyuanVideo-1.5で公開されている。
論文参考訳（メタデータ） (Tue, 25 Nov 2025 02:52:10 GMT)
ビデオ生成な公開モデル
リポジトリはGitHub – Tencent-Hunyuan/HunyuanVideo-1.5: HunyuanVideo-1.5: A leading lightweight video generation model

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction [107.5]
IterResearchは、マルコフ決定過程として長期研究を再構築する、新しい反復的深層研究パラダイムである。 6つのベンチマークで平均+14.5ppの既存のオープンソースエージェントよりも大幅に改善されている。これは効果的なプロンプト戦略として機能し、ロングホライゾンタスクにおけるReActよりも19.2ppのフロンティアモデルを改善する。
論文参考訳（メタデータ） (Mon, 10 Nov 2025 17:30:08 GMT)
長い処理を必要とする問題に対して通常行われる「The mono-contextual approach linearly accumulates all information into a single, ever- expanding context, leading to context suffocation and noise contamination.」からの改善、「IterResearch models deep research as an extended MDP with workspace reconstruction. Each round begins with a reconstructed workspace st containing the question, an evolving report Mt, and immediate context. The agent generates structured decisions dt = (Think, Report, Action) and interacts with environment E. The transition function T reconstructs the workspace, maintaining the Markov property while preventing context bloat and enabling sustained reasoning and information-seeking.」という手法を提案。AIといえども（？）情報整理は重要。
多くのベンチマークでスコアを改善。

DeepAgent: A General Reasoning Agent with Scalable Toolsets

DeepAgent: A General Reasoning Agent with Scalable Toolsets [111.6]
DeepAgentは、自律的な思考、ツール発見、アクション実行を実行するエンドツーエンドのディープ推論エージェントである。長期にわたる相互作用の課題に対処するために,過去の相互作用を構造化エピソード,動作,ツール記憶に圧縮する自律的メモリ折り畳み機構を導入する。 LLMシミュレートされたAPIを活用し、ツール呼び出しトークンにきめ細かいクレジットを割り当てるツールコールアドバンテージ属性を適用した、エンドツーエンドの強化学習戦略であるToolPOを開発した。
論文参考訳（メタデータ） (Fri, 24 Oct 2025 16:24:01 GMT)
ツール利用等も可能になるエージェントフレームワークの紹介。QwQ-32Bをバックボーンとして有効性を検証している。
リポジトリはGitHub – RUC-NLPIR/DeepAgent: 🛠️ DeepAgent: A General Reasoning Agent with Scalable Toolsets

Tongyi DeepResearch Technical Report

Tongyi DeepResearch Technical Report [109.8]
Tongyi DeepResearchは、自律的な深層研究機関にインセンティブを与えるため、エンドツーエンドのトレーニングフレームワークを通じて開発されている。 Tongyi DeepResearchは合計35億のパラメータを達成している。私たちは、コミュニティを強化するためのモデル、フレームワーク、完全なソリューションをオープンソースにしています。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 17:53:02 GMT)
「Tongyi DeepResearch establishes a new state-of-the-art with substantially fewer parameters, comprising a total of 30.5 billion parameters while activating only 3.3 billion per token, building upon the Qwen3- 30B-A3B-Base model (Yang et al , 2025). Empirical evaluations on deep research benchmarks demonstrate the effectiveness of our agent.」と高効率なモデルを活用したDeepResearch、商用環境を上回る性能を主張。
プロジェクトサイトはTongyi DeepResearch: A New Era of Open-Source AI Researchers | Tongyi DeepResearch

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis

FinDeepResearch: Evaluating Deep Research Agents in Rigorous Financial Analysis [110.6]
HisRubricは階層的な分析構造ときめ細かいグレーディングルーブリックを備えた新しい評価フレームワークである。 FinDeepResearchは、4つの言語にまたがる8つの金融市場から64の上場企業からなるベンチマークである。 6つのDRエージェント、深い推論能力と探索能力を備えた5つのLLM、深い推論能力を持つ5つのLLMを含む16の代表的な手法を用いてFinDeepResearchに関する広範な実験を行った。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 17:21:56 GMT)
金融ドメインのDeepResearchの評価。o3 deepresearchの性能が高い（Grok4やGemini 2.5 Proとは僅差）が「Our experiments suggest that even top-performing DR agents struggle to consistently balance a coherent analytical structure with factual accuracy. This imbalance remains the primary barrier to their deployment in high-stakes applications.」とのこと。。

InfoAgent: Advancing Autonomous Information-Seeking Agents

InfoAgent: Advancing Autonomous Information-Seeking Agents [143.2]
本稿では,革新的なデータ合成パイプラインとWeb検索ツールを駆使したディープリサーチエージェントInfoAgentを紹介する。我々の方法では、InfoAgentはBrowseCompで15.3%、BrowseComp-ZHで29.2%、Xbench-DSで40.4%の精度を達成した。
論文参考訳（メタデータ） (Mon, 29 Sep 2025 17:59:57 GMT)
Deep Researchエージェントの構築。Qwen3 14Bベースで合成データを活用、「In the first stage, we perform supervised finetuning (SFT) as a cold start, in order to instill long-horizon search behavior into the model.」、「In the second stage, we apply RL to refine its ability of reasoning-driven tool use.」の2段階でのpost training。
合成データ、post trainingの有効性を示す結果で、ベースモデルサイズもお手頃感がある。このようなSLMの開発が流行っていく可能性を感じる結果。

WebWeaver, WebResearcher

Tongyi DeepResearch: A New Era of Open-Source AI Researchers | Tongyi DeepResearch関連、WebWeaverと WebResearcherの論文が出ていた。近いが様々なアプローチを試しているよう。

WebWeaver: Structuring Web-Scale Evidence with Dynamic Outlines for Open-Ended Deep Research [73.6]
本稿では、AIエージェントが膨大なWebスケール情報を洞察に富むレポートに合成しなければならない複雑な課題である、オープンエンドディープリサーチ(OEDR)に取り組む。人間の研究プロセスをエミュレートする新しいデュアルエージェントフレームワークであるWebWeaverを紹介する。
論文参考訳（メタデータ） (Tue, 16 Sep 2025 17:57:21 GMT)
「In this paper, we introduced WebWeaver, a novel dual-agent framework designed to overcome the fundamental flaws of static, machine-like pipelines in open-ended deep research (OEDR). By emulating the human cognitive process that integrates the planner’s dynamic research cycle with the writer’s hierarchical retrieval and writing process, WebWeaver consistently outperforms both proprietary and open-source systems, establishing a new state-of-the-art.」

WebResearcher: Unleashing unbounded reasoning capability in Long-Horizon Agents [72.3]
WebResearcherは、マルコフ決定プロセスとしてディープリサーチを再構築する反復的なディープリサーチパラダイムである。 WebResearcherは最先端のパフォーマンスを実現し、フロンティアのプロプライエタリシステムを超えています。
論文参考訳（メタデータ） (Tue, 16 Sep 2025 17:57:17 GMT)
「(1) IterResearch, an iterative paradigm that reformulates deep research as a Markov Decision Process with periodic consolidation, overcoming the context suffocation and noise contamination of mono-contextual approaches; (2) WebFrontier, a scalable data synthesis engine that addresses training data scarcity through tool-augmented complexity escalation; and (3) a Research-Synthesis Framework that enables effective test-time scaling through parallel multi-agent exploration」の３要素からなるフレームワーク。

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31