ベンチマーク – ページ 2 – arXiv最新論文の紹介

MM-IFEngine: Towards Multimodal Instruction Following

MM-IFEngine: Towards Multimodal Instruction Following [85.9]
高品質なイメージインストラクションペアを生成するパイプラインであるMM-IFEngineを提案する。 MM-IFInstruct-23kはSFT(Supervised Fine-Tuning)に適しているが、DPO(Direct Preference Optimization)のためにMM-IFDPO-23kとして拡張されている。また、MM-IFEvalは、困難で多様なマルチモーダル命令追従ベンチマークである。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 17:59:12 GMT)
「the instruction-following ability of Multimodal Large Language Models」のベンチマークとモデル（公開モデルベース）の提案。商用モデルの強力さが目立つ。また、「DPO using MM-IFDPO-23k significantly surpasses SFT on MMIFInstruct-23k」は興味深い。
リポジトリはGitHub – SYuan03/MM-IFEngine: MM-IFEngine: Towards Multimodal Instruction Following

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation

CrossWordBench: Evaluating the Reasoning Capabilities of LLMs and LVLMs with Controllable Puzzle Generation [53.5]
CrossWordBenchは、大きな言語モデル(LLM)とLVLM(Large Vision-Language Models)の推論能力を評価するために設計されたベンチマークである。評価の結果,LLMの推論は,クロスレター制約を効果的に活用することにより,非推論モデルよりも大幅に優れていることがわかった。本研究は,現在のLLMとLVLMの推論能力の限界について考察し,今後の評価のために,マルチモーダル制約タスクを作成するための効果的なアプローチを提供する。
論文参考訳（メタデータ） (Sun, 30 Mar 2025 20:03:36 GMT)
クロスワードパズルを用いるベンチマーク「CrossWordBench collects data and generates puzzles from three sources: (1) multilingual word-clue pairs from public repositories, (2) dictionary-based definitions, and (3) adapted questions-answer pairs from existing benchmarks (e g , CommonsenseQA (Talmor et al , 2018)) where the answers are open-ended or unconstrained.」という構築方針。結果は「Our extensive evaluation of over 20 models shows that reasoning models substantially outperform non-reasoning counterparts and can benefit from increased crossing-letter constraints.」とLRMは強い
リポジトリはGitHub – SeanLeng1/CrossWordBench、HINT-lab/CrossWordBench · Datasets at Hugging Face

PaperBench: Evaluating AI’s Ability to Replicate AI Research

PaperBench: Evaluating AI’s Ability to Replicate AI Research [3.5]
PaperBenchは、AIエージェントが最先端のAI研究を複製する能力を評価するベンチマークである。エージェントは、スクラッチから20個のICML 2024 SpotlightとOralの文書を複製する必要がある。 PaperBenchには8,316の個別の段階的なタスクが含まれている。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 15:55:24 GMT)
OpenAIによる「PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.」の提案。
リポジトリはGitHub – openai/preparedness: Releases from OpenAI Preparedness

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [26.0]
Java、TypeScript、JavaScript、Go、Rust、C、C++をカバーするマルチ言語問題解決ベンチマークであるMulti-SWE-benchを紹介します。これには合計1,632の高品質なインスタンスが含まれており、68のエキスパートアノテータによって2,456の候補から慎重にアノテートされた。 3つの代表的手法を用いて,Multi-SWE-benchに基づく一連の最先端モデルの評価を行った。大規模強化学習(RL)トレーニングデータセットの構築を目的とした,オープンソースコミュニティのMulti-SWE-RLを立ち上げた。
論文参考訳（メタデータ） (Thu, 03 Apr 2025 14:06:17 GMT)
「we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++.」というある意味多言語なベンチマーク。基本的にOpenHandsの改修版であるMopenHandsが有力に見えるが、言語間で差があるのが興味深い。
- GitHub – All-Hands-AI/OpenHands: 🙌 OpenHands: Code Less, Make More、OpenHandsはIntroducing OpenHands LM 32B — A Strong, Open Coding Agent Modelとコード生成にチューニングしたLLMを作っているのも面白い。
リポジトリはGitHub – multi-swe-bench/multi-swe-bench: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving、リーダーボードはMulti-SWE-bench
「Multi-SWE-RL is an open-source community aimed at developing high-quality RL training datasets for complex software engineering tasks. Its purpose is to serve as the foundational infrastructure for training fully autonomous agents capable of addressing real-world software engineering challenges, paving the way toward achieving AGI.」とAGIに言及があるのと「In light of these advancements, we are firmly convinced that “scaling RL in real-world environments is the path toward human-like intelligence”.」は熱い。

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? [36.8]
CLAIMCHECKは、NeurIPS 2023と2024のアノテートデータセットであり、OpenReviewから抽出されたレビューである。 CLAIMCHECKは、レビューの弱点に関するMLの専門家によって豊富な注釈が付けられており、論文は、それらが矛盾していると主張しており、また、識別された弱点の妥当性、客観性、タイプに関するきめ細かいラベルも主張している。我々は,CLAIMCHECK が支援する3つのクレーム中心タスクについて,(1) 紛争のクレームに弱点を関連付けること,(2) 弱点のきめ細かいラベルを予測し,その特異性を高めるために弱点を書き換えること,(3) 根拠付き推論で論文のクレームを検証すること,の3つについて,LCM をベンチマークする。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 17:29:45 GMT)
「This work has introduced CLAIMCHECK—a benchmark of reviewer-identified weaknesses in NeurIPS 2023 and 2024 submissions, richly annotated with descriptive labels by experts and grounded in the claims that they dispute in the reviewed papers. Further, we benchmark various LLMs on three novel tasks enabled by CLAIMCHECK—Weakness Labeling and Editing (WLE), Claim Association (CA), and Claim Verification (CV)—all aimed at assisting reviewers during the peer review process.」というベンチマークの提案。現在のLLMにとって難しいタスクとなっている。
リポジトリはhttps://github.com/JHU-CLSP/CLAIMCHECKとのこと

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models

MMDT: Decoding the Trustworthiness and Safety of Multimodal Foundation Models [101.7]
MMFM(Multimodal foundation model)は、自律運転、ヘルスケア、バーチャルアシスタントなど、様々なアプリケーションにおいて重要な役割を果たす。既存のマルチモーダルモデルのベンチマークは、主にこれらのモデルの有用性を評価するか、公平性やプライバシといった限られた視点にのみフォーカスする。 MMFMの安全性と信頼性を総合的に評価するために,最初の統合プラットフォームMMDT(Multimodal DecodingTrust)を提案する。
論文参考訳（メタデータ） (Wed, 19 Mar 2025 01:59:44 GMT)
Multimodal foundation modelsの信頼性評価フレームワークの提案。主な対象はsafety, hallucination, fairness, privacy, adversarial robustness, out-of-distribution (OOD) robustness。MMFMsということでT2I、I2Tの両方が含まれる。
プロジェクトサイトはMMDecodingTrust Benchmark、リーダーボードも存在するMMDecodingTrust Benchmark。公開モデルより商用モデルの方が平均的にはスコアが高そうだが、評価軸によって状況が大きく異なるのが興味深い。

EnvBench: A Benchmark for Automated Environment Setup

EnvBench: A Benchmark for Automated Environment Setup [76.0]
大規模言語モデルにより、研究者はソフトウェア工学領域における実用的なリポジトリレベルのタスクに集中できるようになった。環境設定に関する既存の研究は革新的なエージェント戦略を導入しているが、その評価は小さなデータセットに基づいていることが多い。このギャップに対処するため、包括的環境設定ベンチマークEnvBenchを紹介します。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 17:19:12 GMT)
環境設定に関するベンチマーク。実用上はとても大事で状況によってはコード生成よりうれしいことがあるかもしれない。。
エージェントを使ってなおスコアが低い難しいベンチマークのよう。
リポジトリはGitHub – JetBrains-Research/EnvBench: [DL4C @ ICLR 2025] A Benchmark for Automated Environment Setup、🌱⚙️ EnvBench – a JetBrains-Research Collection

BIG-Bench Extra Hard

BIG-Bench Extra Hard [98.4]
大規模言語モデル(LLM)は、ますます日常的なアプリケーションにデプロイされ、堅牢な一般的な推論機能を必要としている。 BIG-Benchデータセットは、LLMの一般的な推論能力を評価するための重要なベンチマークとして機能している。最先端のモデルは、BIG-Benchの多くのタスクにおいてほぼ完璧なスコアを得るため、その実用性は低下する。 BIG-Bench Extra Hard (BBEH) は, LLM推論評価のバウンダリを推し進めるための新しいベンチマークである。
論文参考訳（メタデータ） (Wed, 26 Feb 2025 14:50:50 GMT)
BIG-Benchの強化版、「Solving the tasks in BBEH requires even further reasoning skills than the problems in BBH. These skills include, but are not limited to, many-hop reasoning, learning on the fly, finding errors in reasoning traces, processing long-context inputs and finding (multi-)needles in a haystack, going against strong prior, dealing with long-range dependencies, dealing with distractors and inducing patterns from examples.」と推論に関する能力が必要になるよう。LRM、o3-mini(high)はまずまずのスコアである一方で一部タスクを苦手としているDeepseek R1のスコアが低いのが興味深い。
リポジトリはGitHub – google-deepmind/bbeh

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models

CodeCriticBench: A Holistic Code Critique Benchmark for Large Language Models [97.2]
本稿では,Large Language Models (LLMs) のコード批判ベンチマークであるCodeCriticBenchを紹介する。具体的には、CodeCriticBenchには2つの主要なコードタスク(コード生成とコードQA)が含まれています。さらに、評価プロトコルには、基本的な批評評価と、異なる特性に対する高度な批評評価が含まれる。
論文参考訳（メタデータ） (Sun, 23 Feb 2025 15:36:43 GMT)
「To evaluate the critique abilities of LLMs on the code domain, we introduce the first holistic code critique benchmark CodeCriticBench, which includes the critique on both code generation and code QA tasks.」という珍しいタスクに対するベンチマーク。DeepSeek-R1とOpenAI o1-Previewの能力が高い。
リポジトリはGitHub – multimodal-art-projection/CodeCriticBench

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking [96.9]
我々は,工学的問題に対する完全かつ実現可能なソリューションを生成するシステムの能力を評価するために,新しいベンチマークであるSolutionBenchを導入する。本稿では,木に基づく探索と二点思考機構を利用して信頼性の高いソリューションを生成する新しいシステムであるSolutionRAGを提案する。
論文参考訳（メタデータ） (Fri, 28 Feb 2025 05:23:10 GMT)
工学の問題に対するソリューションを生成するベンチマークSolutionBenchと、それを解く手法SolutionRAGの提案。RAGとあるが「 SolutionRAG employs a bi-point thinking approach, alternating between solution design and review, gradually enhancing the solution’s completeness and reliability.」というツリーを作りながらの探索でAgenticなアプローチ。
リポジトリはGitHub – Li-Z-Q/DeepSolution: DeepSolution: Boosting Complex Engineering Solution Design via Tree-based Exploration and Bi-point Thinking

2025年5月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31