ベンチマーク – ページ 9 – arXiv最新論文の紹介

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Establishing Best Practices for Building Rigorous Agentic Benchmarks [94.7]
多くのエージェントベンチマークではタスク設定や報酬設計が問題となっている。このような問題は、相対的な用語で、過小評価または過大評価エージェントのパフォーマンスを最大100%向上させる可能性がある。我々はベンチマーク構築経験から要約したガイドラインの集合であるAgentic Benchmark Checklist (ABC)を紹介した。
論文参考訳（メタデータ） (Thu, 03 Jul 2025 17:35:31 GMT)
構築が難しいエージェント系ベンチマークの注意点をまとめた論文。
「the issues found in τ-bench-Airline, some other example issues we found are: (1) an agent can score 100% on SWE-Lancer without resolving any tasks;」のような問題は相応にある気がするし、「Based on ABC, we assessed ten widely used agentic benchmarks and identified significant evaluation issues that cases up to 100% errors (in relative terms) when estimating agents’ performance.」も驚愕という感じではない。
リポジトリはGitHub – uiuc-kang-lab/agentic-benchmarks

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies

RoboArena: Distributed Real-World Evaluation of Generalist Robot Policies [125.4]
本稿では,実世界における汎用ロボットポリシーのスケーラブルな評価手法であるRoboArenaを提案する。固定タスク,環境,場所に関する評価を標準化する代わりに,評価者の分散ネットワークにまたがるクラウドソース評価を提案する。我々は、DROIDロボットプラットフォームを用いて、7つの学術機関における評価者のネットワークにアプローチをインスタンス化する。
論文参考訳（メタデータ） (Sun, 22 Jun 2025 18:13:31 GMT)
「In this work, we propose RoboArena, a new approach for scalable evaluation of generalist robot policies in the real world.」というrobot policyにフォーカスした評価フレームワークの提案。
プロジェクトサイトはRoboArena

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation

MultiFinBen: A Multilingual, Multimodal, and Difficulty-Aware Benchmark for Financial LLM Evaluation [89.7]
MultiFinBenは、グローバルファイナンシャルドメインに合わせた最初のマルチリンガルおよびマルチモーダルベンチマークである。我々は,最初のOCR組み込み財務QAタスクである EnglishOCR と SpanishOCR の2つの新しいタスクを紹介する。本稿では,動的で難易度の高い選択機構を提案し,コンパクトでバランスの取れたベンチマークをキュレートする。
論文参考訳（メタデータ） (Mon, 16 Jun 2025 22:01:49 GMT)
金融ドメインのマルチモーダル、マルチリンガルベンチマーク。日本語データも含まれているよう。
リポジトリはGitHub – xueqingpeng/MultiFinBen、データはHuggingFaceで公開されている（TheFinAI/PolyFiQA-Easy · Datasets at Hugging Faceなど）

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents [34.4]
コンピュータ使用エージェントの安全性を計測する新しいベンチマークであるOS-Harmを紹介する。 OS-HarmはOSWorld環境上に構築されており、故意のユーザ誤用、インジェクション攻撃、モデル誤動作の3つのカテゴリでモデルをテストすることを目指している。我々は、フロンティアモデルに基づいてコンピュータ利用エージェントを評価し、その安全性に関する洞察を提供する。
論文参考訳（メタデータ） (Tue, 17 Jun 2025 17:59:31 GMT)
「First, we identify three main categories of risk: (1) deliberate user misuse, where the user asks the agent to pursue a harmful goal, (2) prompt injection attacks, where external attackers insert malicious content into third-party data (incoming emails, web pages, notifications, etc.) that steers the model away from performing its task and towards the attacker’s goal, and (3) model misbehavior, including benign tasks which are likely to result in costly mistakes or reveal model misalignment. For each category, we design tasks that differ in the type of safety violations and in the apps they require (such as Thunderbird, VS Code, Terminal, LibreOffice Impress, etc.), for a total of 150 tasks.」というベンチマークの提案。
リポジトリはGitHub – tml-epfl/os-harm: OS-Harm: A Benchmark for Measuring Safety of Computer Use Agents

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios

CRITICTOOL: Evaluating Self-Critique Capabilities of Large Language Models in Tool-Calling Error Scenarios [30.2]
大規模な言語モデルが外部ツールを利用する能力により、ますます多様なタスクに対処できるようになった。タスクがより複雑で長期的になると、複雑なツール利用プロセスが様々な予期せぬエラーを引き起こす可能性がある。このようなエラーの特定、診断、回復など、効果的に対処する方法が、ツール学習を進める上で重要な研究方向として現れている。
論文参考訳（メタデータ） (Wed, 11 Jun 2025 17:59:18 GMT)
「ICTOOL, the first self-critique evaluation benchmark for tool utilization of LLMs. Distinct from prior result-oriented evaluation methods, we categorize error patterns more finely and evaluate models from multiple perspectives, enabling a deeper exploration of LLMs’ tool-use capabilities in errorprone scenarios.」というベンチマーク。最新モデルでの結果が気になるところ。
リポジトリはGitHub – Shellorley0513/CriticTool

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark

MMTU: A Massive Multi-Task Table Understanding and Reasoning Benchmark [70.5]
MMTUは、25の現実世界のテーブルタスクに30万以上の質問がある大規模なベンチマークである。 MMTUは、専門家レベルで実際のテーブルを理解し、推論し、操作できるモデルを包括的に評価するように設計されている。 MMTUはテーブル理解、推論、コーディングといった、今日のフロンティアモデルにとって困難なスキルの組み合わせを必要としています。
論文参考訳（メタデータ） (Thu, 05 Jun 2025 21:05:03 GMT)
「We show that MMTU require a combination of skills – includ- ing table understanding, reasoning, and coding – that remain challenging for today’s frontier models, where even frontier reasoning models like OpenAI o4- mini and DeepSeek R1 score only around 60%, suggesting significant room for improvement.」という数表を扱うベンチマーク
リポジトリはGitHub – MMTU-Benchmark/MMTU、データはMMTU-benchmark/MMTU · Datasets at Hugging Face

Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

Scientists’ First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning [59.5]
我々は,Multimodal Large Language Models (MLLM) の科学的認知能力を評価するために設計された,Scientists’ First Exam (SFE) ベンチマークを提示する。 SFEは3つの質問タイプにまたがる830のエキスパート検証VQAペアで構成され、5つの高価値分野にまたがる66のマルチモーダルタスクにまたがる。実験の結果、現在最先端のGPT-o3とInternVL-3はSFEでわずか34.08%と26.52%しか達成できず、MLLMが科学領域で改善する余地があることが明らかになった。
論文参考訳（メタデータ） (Thu, 12 Jun 2025 09:29:16 GMT)
「we introduce the Scientists’ First Exam (SFE) benchmark, designed to comprehensively evaluate the scientific cognitive capabilities of MLLMs through three cognitive levels (cog-levels): Scientific Signal Perception (L1) characterizes the capacity to discern critical components within visualizations of scientific raw data; Scientific Attribute Understanding (L2) demonstrates the ability to interpret domain-expert knowledge; Scientific Comparative Reasoning (L3) manifests the ability to derive phenomenological insights through structured comparison of multiple scientific visual sources. SFE encompasses 66 expert-curated, high-value multimodal tasks across five disciplines: Astronomy, Chemistry, Earth, Life, and Materials Sciences (Fig. 1b).」というベンチマーク。MLLM向け、VQAとして構成されている。
リポジトリはPrismaX/SFE · Datasets at Hugging Face、プロジェクトサイトはPrismaX

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning

FinChain: A Symbolic Benchmark for Verifiable Chain-of-Thought Financial Reasoning [43.7]
FinChainは、検証可能なChain-of-Thought(CoT)金融推論のための最初のシンボリックベンチマークである。 FinChainはトピック毎に5つのパラメータ化されたテンプレートを提供する。データセット上で30 LLMをベンチマークすると、最先端モデルでさえ改善の余地がかなりあることが分かります。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 06:44:42 GMT)
金融分野、CoTのベンチマーク。「We also introduce ChainEval, a new metric for automatic evaluation of both final answers and intermediate reasoning. Bench- marking 30 LLMs on our dataset, we find that even state-of-the-art models have consider- able room for improvement in multi-step finan- cial reasoning.」と推論過程を評価するフレームワークも提案。
リポジトリはGitHub – mbzuai-nlp/finchain: A symbolic benchmark for verifiable chain-of-thought financial reasoning. Includes executable templates, 54 topics across 12 domains, and ChainEval metrics.

SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

SVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation [46.5]
SVGeniusは3つのプログレッシブディメンション(理解、編集、生成)にわたる2,377のクエリからなる総合ベンチマークである。 SVGeniusは、システマティックな複雑性層を持つ24のアプリケーションドメインの実際のデータに基づいて、8つのタスクカテゴリと18のメトリクスでモデルを評価する。
論文参考訳（メタデータ） (Tue, 03 Jun 2025 17:58:57 GMT)
SVGを対象としたベンチマーク、「Evaluation of 22 models reveals that while proprietary models outperform open-source counterparts, all models degrade with increasing complexity, and reasoning- enhanced training proves more effective than pure scaling.」とのこと。
リポジトリはSVGenius: Benchmarking LLMs in SVG Understanding, Editing and Generation

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model

SridBench: Benchmark of Scientific Research Illustration Drawing of Image Generation Model [21.8]
SridBenchは、科学フィギュア生成のための最初のベンチマークである。これは13の自然科学とコンピュータ科学の分野にわたる主要な科学論文から1,120の事例で構成されている。その結果、GPT-4o画像のような最上位モデルでさえ、人間のパフォーマンスに遅れがあることが判明した。
論文参考訳（メタデータ） (Wed, 28 May 2025 08:51:01 GMT)
科学的な図の生成に関するベンチマーク作成とその検証。データは公開されていない？
「We found that, with the exception of GPT-4o-image, other image generation models, such as Gemini- 2.0-Flash, do not have any scientific mapping capabilities.」とのこと。。

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31