評価手法 – arXiv最新論文の紹介

Fluid Language Model Benchmarking

Fluid Language Model Benchmarking [126.9]
我々は,複数の次元にわたるLMベンチマークを進展させる新しい評価手法であるFluid Benchmarkingを紹介する。サイコメトリックスにインスパイアされたFluid Benchmarkingは、ベンチマーク項目の相対値がLMの能力レベルに依存するという洞察に基づいている。効率性,妥当性,分散性,飽和性の4つの次元を検証した結果,Fluid Benchmarkingがすべてにおいて優れた性能を発揮することがわかった。
論文参考訳（メタデータ） (Sun, 14 Sep 2025 05:49:42 GMT)
「we introduce FLUID BENCHMARKING, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, FLUID BENCHMARKING is based on the insight that the relative value of benchmark items depends on an LM’s capability level, suggesting that evaluation should adapt to each LM. Methodologically, FLUID BENCH- MARKING estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education.」との評価方法の提案。
リポジトリはGitHub – allenai/fluid-benchmarking: Fluid Language Model Benchmarking

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games [56.8]
我々はFlashAdventureを紹介した。これは、フルストーリーのアーク補完をテストするために設計された、34のFlashベースのアドベンチャーゲームのベンチマークである。また,ゲームプレイの自動評価装置であるCUA-as-a-Judgeと,長期記憶を利用したエージェントフレームワークであるCOASTを提案する。実験では、現在のGUIエージェントがフルストーリーのアークに苦しむのに対して、COASTは観察と振る舞いのギャップを埋めることでマイルストーンの完了を改善する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 01:33:16 GMT)
アドベンチャーゲームを利用したベンチマークと「We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves mile- stone completion by bridging the observation- behavior gap.」という評価システムの提案。現状のSuccess Rateはとても低いが今後どのくらいの速度で改善していくかが楽しみ。
プロジェクトサイトはFlashAdventure

Pitfalls in Evaluating Language Model Forecasters

Pitfalls in Evaluating Language Model Forecasters [45.4]
我々はコミュニティとして、大きな言語モデルを評価するような結論に注意する必要があると論じている。 1) 時間的リークによる評価結果の信頼の難しさ,(2) 評価性能から実世界の予測への外挿の難しさ,の2つのカテゴリを識別する。
論文参考訳（メタデータ） (Sat, 31 May 2025 21:49:17 GMT)
LLMの評価に関する落とし穴をまとめた論文
「We identify two broad categories of issues: (1) difficulty in trusting evaluation results due to many forms of temporal leakage, and (2) difficulty in extrapolating from evaluation performance to real-world forecasting. Through systematic analysis and concrete examples from prior work, we demonstrate how evaluation flaws can raise concerns about current and future performance claims.」というまとめだが、評価は本当に難しい。

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models

Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models [75.9]
SAGE(Sentient Agent as a Judge)は、大規模言語モデルの評価フレームワークである。 SAGEは人間のような感情の変化や内的思考をシミュレートするSentient Agentをインスタンス化する。 SAGEは、真に共感的で社会的に適応的な言語エージェントへの進捗を追跡するための、原則付き、スケーラブルで解釈可能なツールを提供する。
論文参考訳（メタデータ） (Thu, 01 May 2025 19:06:10 GMT)
「SAGE instantiates a Sentient Agent that simulates human- like emotional changes and inner thoughts during interaction, providing a more realistic evaluation of the tested model in multi-turn conversations. At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.」（SAGE=Sentient Agent as a Judge）という評価フレームワークの提案。「rankings produced by SAGE diverge markedly from Arena results, confirming that social cognition is orthogonal to generic helpfulness. 」とのこと。
リポジトリはdigitalhuman/SAGE at main · Tencent/digitalhuman · GitHub

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

xVerify: Efficient Answer Verifier for Reasoning Model Evaluations [24.1]
推論モデル評価のための効率的な答え検証器であるxVerifyを提案する。 xVerifyは同値判定において強い能力を示し、推論モデルによって生成された答えが参照回答と等価であるかどうかを効果的に決定できる。テストセットと一般化セットの両方で実施された評価実験では、すべてのxVerifyモデルが全体のF1スコアと95%を超える精度を達成する。
論文参考訳（メタデータ） (Mon, 14 Apr 2025 17:59:36 GMT)
LRM向けの「Verify Answer for Reasoning (VAR) dataset」と回答検証モデルの提案。「xVerify demonstrates strong capability in equivalence judgment, enabling it to effectively determine whether the answers produced by reasoning models are equivalent to reference answers across various types of objective questions.」とのことで、「xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o, while xVerify-3B-Ib surpasses GPT-4o in overall performance.」という性能。
リポジトリはGitHub – IAAR-Shanghai/xVerify: xVerify: Efficient Answer Verifier for Reasoning Model Evaluations

Measuring AI Ability to Complete Long Tasks

Measuring AI Ability to Complete Long Tasks [6.0]
人間が通常、AIモデルが達成できるタスクを完了するのに要する時間を50%の成功率で測定します。 Claude 3.7 Sonnetのような現在のフロンティアAIモデルは50分程度で50%タイムの地平線を持つ。 AIモデルの時間的地平線の増加は、より信頼性が高く、ミスに適応する能力によって引き起こされているように思われる。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 17:59:31 GMT)
「the time humans typically take to complete tasks that AI models can complete with 50% success rate」を定義とする「50%-task-completion time horizon」というメトリクスの提案と検討。「On these tasks, current frontier AI models such as Claude 3.7 Sonnet have a 50% time horizon of around 50 minutes」、「Furthermore, frontier AI time horizon has been doubling approximately every seven months since 2019, though the trend may have accelerated in 2024.」とのこと。
どのくらいの規模のソフトウェアを自動生成できるのか？という意味では参考になる指標だと思う。「Finally, we attempt to extrapolate the trend on our tasks to one-month (167 hours) AI (Section 7.1), finding that if both the trend continues and observed performance trends generalize to real-world tasks, an 80% confidence interval for the release date of AI that can complete 1-month long software tasks spans from late 2028 to early 2031」をどう評価するかは悩ましいが、人が一か月かけて開発するレベルのソフトウェアが自動生成できるようになるかも、というのはそうかもしれないという感覚もある。

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue

LLM-guided Plan and Retrieval: A Strategic Alignment for Interpretable User Satisfaction Estimation in Dialogue [5.1]
PRAISEは効果的なユーザ満足度予測のための解釈可能なフレームワークである。 3つのモジュールを通して動作する。ユーザ満足度推定タスクの3つのベンチマークで最先端のパフォーマンスを達成する。
論文参考訳（メタデータ） (Thu, 06 Mar 2025 18:12:33 GMT)
ユーザ満足度を推定するためのフレームワーク「PRAISE (Plan and Retrieval Alignment for Interpretable Satisfaction Estimation)」の提案。AgenticなアプローチでStrategy Planner、Feature Retriever、Score Analyzerで構成。
興味深い結果だが、LLM（API）が若干古いような気がしなくもない。最新のAPIだとどのような結果になるのだろうか。

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems

IntellAgent: A Multi-Agent Framework for Evaluating Conversational AI Systems [2.3]
IntellAgentは、対話型AIシステムを評価するためのスケーラブルでオープンソースのフレームワークである。 IntellAgentは、ポリシー駆動グラフモデリング、リアルイベント生成、対話型ユーザエージェントシミュレーションを組み合わせることで、合成ベンチマークの作成を自動化する。我々の研究は、IntellAgentが、研究と展開の橋渡しの課題に対処することで、会話AIを前進させるための効果的なフレームワークであることを示した。
論文参考訳（メタデータ） (Sun, 19 Jan 2025 14:58:35 GMT)
対話型AIの評価フレームワーク
リポジトリはGitHub – plurai-ai/intellagent: A framework for comprehensive diagnosis and evaluation of conversational agents using simulated, realistic synthetic interactions

Benchmarking Large and Small MLLMs

Benchmarking Large and Small MLLMs [71.8]
大規模なマルチモーダル言語モデル(MLLM)は、マルチモーダルコンテンツの理解と生成において顕著な進歩を遂げている。しかし、そのデプロイメントは、遅い推論、高い計算コスト、デバイス上のアプリケーションに対する非現実性など、重大な課題に直面している。 LLavaシリーズモデルとPhi-3-Visionによって実証された小さなMLLMは、より高速な推論、デプロイメントコストの削減、ドメイン固有のシナリオを扱う能力を備えた有望な代替手段を提供する。
論文参考訳（メタデータ） (Sat, 04 Jan 2025 07:44:49 GMT)
MLLMの包括的評価。
「GPT-4o establishes a new standard for multimodal understanding and reasoning across diverse input types, setting a benchmark in versatility and cognitive capacity.」のほか、「Although LLaVA-NeXT and Phi-3-Vision excel in specialized recognition tasks, they exhibit limitations in advanced reasoning and temporal sequence processing.」とのこと。
MSの調査でもあり、Phi4でのアップデートにも期待。microsoft/phi-4 · Hugging Face

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.9]
MLLM(Multimodal Large Language Models)は、産業と学術の両方から注目を集めている。開発プロセスでは、モデルの改善に関する直感的なフィードバックとガイダンスを提供するため、評価が重要である。この研究は、研究者に異なるニーズに応じてMLLMを効果的に評価する方法を簡単に把握し、より良い評価方法を促すことを目的としている。
論文参考訳（メタデータ） (Fri, 22 Nov 2024 18:59:54 GMT)
MLLMの評価に関するサーベイで、リポジトリ　GitHub – BradyFU/Awesome-Multimodal-Large-Language-Models at Benchmarks　が非常に充実。

2025年11月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30