ベンチマーク – ページ 13 – arXiv最新論文の紹介

OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far?

OlympicArena Medal Ranks: Who Is the Most Intelligent AI So Far? [24.7]
我々は、最近リリースされたClaude-3.5-Sonnet、Gemini-1.5-Pro、GPT-4oに焦点を当てている。本稿では,各種分野にわたる総合的なパフォーマンスに基づいて,初めてオリンピック・メダリスト・テーブルを用いてAIモデルをランク付けする手法を提案する。
論文参考訳（メタデータ） (Mon, 24 Jun 2024 16:31:12 GMT)
最新LLMを含むベンチマーク結果、「Claude-3.5-Sonnet shows highly competitive overall performance over GPT-4o, even surpassing GPT-4o on a few subjects (i.e., Physics, Chemistry and Biology)」、「Gemini-1.5-Pro and GPT-4V are ranked consecutively just behind GPT-4o and Claude-3.5-Sonnet, but with a clear performance gap between them.」と現時点ではGPT-4oとClaude 3.5 Sonnetが双璧のよう。
リポジトリはGitHub – GAIR-NLP/OlympicArena: This is the official repository of the paper “OlympicArena: Benchmarking Multi-discipline Cognitive Reasoning for Superintelligent AI”

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track

Ragnarök: A Reusable RAG Framework and Baselines for TREC 2024 Retrieval-Augmented Generation Track [51.3]
RAGベースの検索システムを構築、テスト、視覚化、体系的に評価するためのアリーナを持つことが不可欠である。 TREC 2024 RAG Trackを提案する。
論文参考訳（メタデータ） (Mon, 24 Jun 2024 17:37:52 GMT)
すごい名前のRAG評価用ベンチマーク・フレームワーク
リポジトリはGitHub – castorini/ragnarok: Retrieval-Augmented Generation battle!

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading

SciEx: Benchmarking Large Language Models on Scientific Exams with Human Expert Grading and Automatic Grading [100.3]
LLM(Large Language Models)の一般的な用途は、科学的トピックに関するタスクを実行することである。そこで本稿では,大学生のこのような課題に対する評価方法に着想を得たSciExを提案する。我々は,新しいベンチマークを用いて,最先端のLLMの性能評価を行った。
論文参考訳（メタデータ） (Fri, 14 Jun 2024 21:52:21 GMT)
大学生のを対象とした試験のベンチマーク「SciEx is (1) multilingual, containing both English and German exams, and (2) multi-modal, containing questions that involve images, and (3) contains various types of freeform questions with different difficulty levels, due to the nature of university exams.」とのこと。意外なことに（？）GPT-4VよりもClaude Opusのほうが高いスコア。
リポジトリはtuanh23/SciEx · Datasets at Hugging Face

CS-Bench

CS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery [26.4]
計算機科学における大規模言語モデルの性能評価のための最初のベンチマークであるCS-Benchを紹介する。 CS-Benchは、コンピュータ科学の4つの重要な領域にまたがる26のサブフィールドをカバーする、5Kの精巧にキュレートされたテストサンプルで構成されている。 CS性能とモデルスケールの関係を明らかにするため,30以上のLLMの総合評価を行った。
論文参考訳（メタデータ） (Wed, 12 Jun 2024 18:47:28 GMT)
コンピュータサイエンスに関するベンチマーク、英語と中国語のバイリンガルデータ。英語・中国語ともにGPT-4oのスコアが最も高いが、中国語のデータではERNIE 4が迫るなどリーダーボードも興味深い結果になっている。
リポジトリはCS-Bench: A Comprehensive Benchmark for Large Language Models towards Computer Science Mastery (csbench.github.io)

The BiGGen Bench

The BiGGen Bench: A Principled Benchmark for Fine-grained Evaluation of Language Models with Language Models [94.3]
BiGGen Benchは、77のタスクにわたるLMの9つの異なる能力を徹底的に評価するために設計された、原則化された世代ベンチマークである。 BiGGen Benchの重要な特徴は、インスタンス固有の評価基準の使用であり、人間の評価のニュアンスな識別を忠実に反映している。
論文参考訳（メタデータ） (Sun, 09 Jun 2024 12:30:30 GMT)
LLMを評価するためのベンチマークの提案、下記９カテゴリ、77タスクからなる。
- Instruction Following
- Grounding
- Planning
- Refinement
- Reasoning
- Tool Usage
- Theory of Mind
- Multilingual
- Safety
リポジトリはprometheus-eval/BiGGen-Bench at main · prometheus-eval/prometheus-eval · GitHub、データはprometheus-eval/BiGGen-Bench · Datasets at Hugging Face、リーダーボードはBiGGen Bench Leaderboard – a Hugging Face Space by prometheus-eval。カテゴリによっても順位が入れ替わるのが興味深い。

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos

MMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos [155.5]
MMWorldは,複数分野のマルチモードビデオ理解のための新しいベンチマークである。 MMWorldは、ビデオ全体に関する質問を伴うMLLMを評価する人間アノテーション付きデータセットと、知覚の単一モード内でMLLMを分析する合成データセットで構成されている。この評価には2つのプロプライエタリなMLLMと10のオープンソースMLLMが含まれており、MMWorldと競合している。
論文参考訳（メタデータ） (Wed, 12 Jun 2024 16:54:54 GMT)
世界モデルとしてのMLLM（例えば物理現象をシミュレートできるか？など）を評価するためのベンチマーク。Leader boardからはGPT-4Vが首位でGeminiProが2位になっている。一方で「Even the best performer, GPT-4V, can only achieve a 52.30% overall accuracy, and four MLLMs particularly trained on videos perform worse than random chance.」という指摘も。MLLMないしビデオ合成系のモデルがWorld modelになりえるかは賛否両論あるが、注目を集めている分野。
リポジトリはMMWorld: Towards Multi-discipline Multi-faceted World Model Evaluation in Videos (mmworld-bench.github.io)

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models

Alice in Wonderland: Simple Tasks Showing Complete Reasoning Breakdown in State-Of-the-Art Large Language Models [13.5]
利用可能な最大規模でトレーニングされた最先端モデルの機能と推論能力の劇的な破壊を実演する。モデルは間違った解に強い自信を表現し、しばしば非感覚的な「推論」のような説明を提供する。
論文参考訳（メタデータ） (Wed, 05 Jun 2024 23:23:54 GMT)
強力なはずのLLMが単純な問題「Alice has N brothers and she also has M sisters. How many sisters does Alice’s brother have?」に回答できないという指摘。MMLUの結果との乖離が大きい。
- Leakを含め色々な問題があるんだろうと思うけど、「We also noticed during early experimentation that depending on choice of N and M and also the ordering of brothers and sisters in the sentence, the rate of correct responses may vary substantially.」は面白い。
リポジトリはGitHub – LAION-AI/AIW: Alice in Wonderland code base for experiments and raw experiments data

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis [118.1]
Video-MMEは、ビデオ解析におけるMLLMの完全なマルチモード評価ベンチマークである。我々は,GPT-4シリーズやGemini 1.5 Pro,オープンソース画像モデルなど,最先端のMLLMを幅広く評価した。我々の実験によると、Gemini 1.5 Proは最も優れた商用モデルであり、オープンソースモデルよりも大幅に優れています。
論文参考訳（メタデータ） (Fri, 31 May 2024 17:59:47 GMT)
ビデオ解析を対象としたベンチマーク。900個、256時間の動画に対して2.7KのQAを人がのテーションしている。ドメインも様々（GitHub – BradyFU/Video-MME: ✨✨Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis）。
現時点のベンチマーク結果はGemini Proがもっともよく、Gemini Flash、GPT-4o、GPT-4Vが続いている。APIによって使えるデータ種類が異なるなど前提を合わせるのが難しい点に注意が必要。例えば「Since the video interface of GPT-4o has not been released yet, we sample 10 frames and evaluate the model using multiple images as input.」という注釈がある。
リポジトリはVideo-MME: Welcome

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities

Evaluating and Modeling Social Intelligence: A Comparative Study of Human and AI Capabilities [29.2]
本研究では,人間の認知の最も顕著な側面の一つである社会的知性を評価するためのベンチマークを紹介する。我々は、社会力学の総合的理論枠組みを開発し、逆推論(IR)と逆逆計画(IIP)の2つの評価タスクを導入した。大規模な実験と分析の結果、人間は最新のGPTモデルを上回る性能、ゼロショット学習、ワンショット一般化、マルチモダリティへの適応性を示した。
論文参考訳（メタデータ） (Mon, 20 May 2024 07:34:48 GMT)
社会的知性を測るためのベンチマーク、対象はInverse Reasoning (IR) とInverse Inverse Planning (IIP)。GPT-4でもタスクによっては人間とギャップがある。結論の「We hope that our study contributes valuable information towards the advancement of ASI.」にASIが出ているのに少しびっくり。
リポジトリはGitHub – bigai-ai/Evaluate-n-Model-Social-Intelligence

STAR: A Benchmark for Situated Reasoning in Real-World Videos

STAR: A Benchmark for Situated Reasoning in Real-World Videos [94.8]
本稿では,実世界のビデオに対して,状況抽象化と論理的質問応答による位置推論能力を評価する新しいベンチマークを提案する。データセットには、インタラクション、シーケンス、予測、実現可能性の4つのタイプが含まれている。本稿では,視覚知覚,状況抽象化,言語理解,機能推論を両立させることができる診断型ニューロシンボリックモデルを提案する。
論文参考訳（メタデータ） (Wed, 15 May 2024 21:53:54 GMT)
動画を通したinteraction, sequence, prediction, feasibilityのベンチマーク
プロジェクトサイトはSTAR: A Benchmark for Situated Reasoning in Real-World Videos (bobbywu.com)

2025年9月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30