ベンチマーク – ページ 10 – arXiv最新論文の紹介

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging

Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging [104.0]
モデルマージは、複数のエキスパートモデルを単一のモデルにまとめることを目的としており、ストレージとサービスコストを削減している。これまでの研究は主に、コードと数学のタスクに視覚分類モデルやLLM(Large Language Models)を統合することに焦点を当ててきた。本稿では,VQA,Geometry,Chart,OCR,Gundingといった複数のタスクを含むMLLMのモデルマージベンチマークを紹介する。
論文参考訳（メタデータ） (Mon, 26 May 2025 12:23:14 GMT)
マルチモーダルなモデルマージに関するベンチマークの紹介。
リポジトリはGitHub – WalkerWorldPeace/MLLMerging: Official implementation of “Unifying Multimodal Large Language Model Capabilities and Modalities via Model Merging”.

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding

PARTONOMY: Large Multimodal Models with Part-Level Visual Understanding [114.5]
画素レベルの部分接地のために設計された LMM ベンチマークである PartONOMY を紹介する。我々はいくつかの部分中心LMMをトレーニングし、セグメント化トークンの代わりにスパンタグを使用する新しいセグメント化LMMであるPLUMを提案する。我々の研究は、LMMにおけるきめ細かい基礎的な視覚的理解を実現するための新たな道を開く。
論文参考訳（メタデータ） (Tue, 27 May 2025 06:03:56 GMT)
「Unfortunately, Large Multimodal Models (LMMs), the backbones of today’s multimodal systems, lack strong part recognition abilities 」とのことで、それを検証するベンチマークと改善モデルPLUM: Part-Level Understanding LMMを提案。
リポジトリはGitHub – AnselBlume/partonomy: Repository for “Partonomy: Large Multimodal Models with Part-Level Visual Understanding”

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation

HumaniBench: A Human-Centric Framework for Large Multimodal Models Evaluation [38.6]
我々は32Kの実世界の画像質問対の総合的なベンチマークであるHumaniBenchを紹介する。 HumaniBenchは、公正性、倫理、理解、推論、言語の傾き、共感、堅牢性を含む7つのHuman Centered AI(HCAI)の原則を評価している。
論文参考訳（メタデータ） (Fri, 16 May 2025 17:09:44 GMT)
「HumaniBench probes seven HCAI principles—fairness, ethics, understanding, reasoning, language inclusivity, empathy, robustness—through seven diverse tasks that mix open- and closed-ended visual question answering (VQA), multilingual QA, visual grounding, empathetic captioning, and robustness tests.」というベンチマーク。商用モデルが優れた結果を出しているが、個別要素ではオープンなモデルが高スコアの場合もある。
プロジェクトサイトはHumaniBench: A Human-Centric Benchmark for Large Multimodal Models Evaluation

MigrationBench: Repository-Level Code Migration Benchmark from Java

MigrationBench: Repository-Level Code Migration Benchmark from Java 8 [18.6]
MigrationBenchは、Java 8 ドルから最新の長期サポート (LTS) バージョン (Java 17、21 ) への移行のための包括的なベンチマークである。この課題に対する大規模言語モデル(LLM)の厳密で標準化された評価を容易にするための総合的な評価フレームワークを提供する。 Claude-3.5-Sonnet-v2 で選択されたサブセットに対して、SD-Feedback は、それぞれ、最小と最大のマイグレーションに対して、62.33%$と27.33%$成功率(pass@1)を達成している。
論文参考訳（メタデータ） (Mon, 19 May 2025 16:10:21 GMT)
バージョン間移植に焦点を当てたベンチマークの提案。実用上大事なタスク。「We demonstrate the feasibility of code migration from Java 8 to 17 through a deterministic workflow with SD-Feedback, and show preliminary results with promising efficacy for both minimal (62.33%) and maximal (27.33%) migration for the selected subset with Claude-3.5-Sonnet-v2.」とのこと。
リポジトリはGitHub – amazon-science/MigrationBench

lmgame-Bench: How Good are LLMs at Playing Games? / TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games

TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games [9.2]
本稿では,Large Language Models(LLM)の推論能力を評価するための新しいフレームワークとデータセットであるTurnaboutLLMを紹介する。このフレームワークは、長い物語の文脈の中で、証言と証拠の間の矛盾を識別するLLMを処理します。提案手法は,12種類のLLMをデータセット上で評価し,導出的推論を向上するための一般的な戦略の限界を示唆した。
論文参考訳（メタデータ） (Wed, 21 May 2025 16:22:32 GMT)
逆転裁判やダンガンロンパを使ったLLMの性能評価ベンチマークの提案。攻略サイトなどがLeakになっていそうだが、総合力が試されるベンチマークではあると思う。LRMが優勢な結果（まぁそうだろうと思う）。
リポジトリはGitHub – zharry29/turnabout_llm

lmgame-Bench: How Good are LLMs at Playing Games? [60.0]
本稿では,現代の大規模言語モデル (LLM) エージェントを評価するために,人気ゲームを使用する上での大きな課題について検討する。我々はlmgame-Benchを導入し、ゲームを信頼性評価に変換する。
論文参考訳（メタデータ） (Wed, 21 May 2025 06:02:55 GMT)
こちらもゲームを用いたベンチマーク・評価。「We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons: brittle vision perception, prompt sensitivity, and potential data contamination.」とLeakの課題が大きいことも指摘している。
リポジトリはGitHub – lmgame-org/GamingAgent: Computer gaming agents that run on your PC and laptops.下のhttps://github.com/lmgame-org/GamingAgent/lmgame-benchとのことだが、現状では404

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge

Multi-Domain Audio Question Answering Toward Acoustic Content Reasoning in The DCASE 2025 Challenge [102.8]
本課題は,多様な音響シーンに対する対話型質問応答における音声モデルをテストするための3つのQAサブセットを定義する。開発セットの予備的な結果を比較し、モデルとサブセット間で強い変動を示す。この課題は、音声モデルの音声理解と推論能力を人間レベルに向上することを目的としている。
論文参考訳（メタデータ） (Mon, 12 May 2025 09:04:16 GMT)
Audio Question Answeringベンチマーク、DCASE 2025 Challengeの説明。audio captioningタスクより一歩進んだもので重要性が増すタスクだと思う。
リポジトリはPeacefulData/2025_DCASE_AudioQA_Official · Datasets at Hugging Face

BAT: Benchmark for Auto-bidding Task

BAT: Benchmark for Auto-bidding Task [67.6]
本稿では,最も普及している2種類のオークション形式を含むオークションベンチマークを提案する。我々は,新しいデータセットに基づいて,一連の堅牢なベースラインを実装した。このベンチマークは、研究者や実践者が革新的なオートバイディングアルゴリズムを開発し、洗練するための、ユーザフレンドリで直感的なフレームワークを提供する。
論文参考訳（メタデータ） (Tue, 13 May 2025 12:12:34 GMT)
「To address this deficiency, we present an auction benchmark en- compassing the two most prevalent auction formats. We implement a series of robust baselines on a novel dataset, addressing the most salient Real-Time Bidding (RTB) problem domains: budget pacing uniformity and Cost Per Click (CPC) constraint optimization.」と珍しいベンチマーク
リポジトリはGitHub – avito-tech/bat-autobidding-benchmark

Benchmarking LLMs’ Swarm intelligence

Benchmarking LLMs’ Swarm intelligence [50.5]
大規模言語モデル(LLM)は複雑な推論の可能性を秘めているが、マルチエージェントシステム(MAS)における創発的協調の能力はほとんど探索されていない。既存のベンチマークは、エージェントが不完全な時間的情報を扱うときに生じる分散調整のユニークな課題を完全には捉えないことが多い。分散エージェントとして機能するLLMのSwarmインテリジェンス能力を体系的に評価する新しいベンチマークであるSwarmBenchを紹介する。
論文参考訳（メタデータ） (Wed, 07 May 2025 12:32:01 GMT)
「we introduce SwarmBench, a novel benchmark designed to systematically evaluate the swarm intelligence capabilities of LLMs acting as decentralized agents. SwarmBench features five foundational MAS coordination tasks (Pursuit, Synchronization, For- aging, Flocking, Transport) within a configurable 2D grid environment, forcing agents to rely primarily on local sensory input (k × k view) and local communication.」というベンチマークの提案。
リポジトリはGitHub – RUC-GSAI/YuLan-SwarmIntell: 🐝 SwarmBench: Benchmarking LLMs’ Swarm Intelligence

SITE: towards Spatial Intelligence Thorough Evaluation

SITE: towards Spatial Intelligence Thorough Evaluation [121.1]
空間知能 (Spatial Intelligence, SI) は、空間的関係の可視化、操作、推論を含む認知能力を表す。 SI Thorough Evaluationに向けたベンチマークデータセットであるSITEを紹介する。ベンチマークの計算には、31の既存のデータセットに関するボトムアップ調査と、認知科学の3つの分類システムに基づくトップダウン戦略を組み合わせる。
論文参考訳（メタデータ） (Thu, 08 May 2025 17:45:44 GMT)
Spatial Intelligenceのベンチマーク。GPT-4oでも人間との差が大きい。（そしてInternVL-2.5-8Bのスコアが意外と高い）
プロジェクトサイトはSITE: towards Spatial Intelligence Thorough Evaluation

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

Benchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency [29.3]
視覚的依存を明示する多モーダル数学的推論のための総合的なベンチマークであるVCBENCHを紹介する。 VCBENCHには6つの認知領域に1,720の問題がある。我々は、VCBENCH上で26の最先端LVLMを評価し、高い性能差を示し、トップモデルでさえ50%以上の精度を達成できなかった。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 03:45:30 GMT)
Visionに依存するよう設計された数学推論ベンチマークの提案
リポジトリはBenchmarking Multimodal Mathematical Reasoning with Explicit Visual Dependency

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31