2026年3月3日 – arXiv最新論文の紹介

DREAM: Deep Research Evaluation with Agentic Metrics [21.6]
本稿では,DREAM(Deep Research Evaluation with Agentic Metrics)を提案する。 DREAM構造評価は、クエリ非依存のメトリクスとツール呼び出しエージェントが生成する適応的なメトリクスを組み合わせた評価プロトコルを用いて行われる。制御された評価は、DREAMが既存のベンチマークよりも事実や時間的劣化にかなり敏感であることを示している。
論文参考訳（メタデータ） (Sat, 21 Feb 2026 19:14:31 GMT)
「DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes.」とファクトチェックを思い出す評価アプローチ。「We demonstrate that current LLM-as-a-judge and reference-based benchmarks are often blinded by surface-level fluency and citation alignment, failing to detect deep-seated defects in factual correctness, temporal validity, and logical reasoning.」はそうだろうと思う。

The Trinity of Consistency as a Defining Principle for General World Models [106.2]
一般世界モデルは、客観的物理法則を学習し、シミュレートし、推論することができる。本稿では,一般世界モデルに必要な基本的特性を定義するための理論的枠組みを提案する。我々の研究は、現在のシステムの限界と将来の進歩のためのアーキテクチャ要件の両方を明確にし、一般的な世界モデルへの原則的な経路を確立します。
論文参考訳（メタデータ） (Thu, 26 Feb 2026 16:15:55 GMT)
「This paper is organized to mirror the evolutionary path from specialized modules to unified world simulators. 」とサーベイ的な論文。「In this paper, we propose that a World Model must be grounded in the Trinity of Consistency: Modal Consistency as the semantic interface, Spatial Consistency as the geometric basis, and Temporal Consistency as the causal engine.」と主張、ベンチマークを公開。
プロジェクトサイトはThe Trinity of Consistency as a Defining Principle for General World Models