ベンチマーク – arXiv最新論文の紹介

Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles

Who Gets Cited Most? Benchmarking Long-Context Language Models on Scientific Articles [81.9]
SciTrekは、科学論文を用いた大規模言語モデル(LLM)の長文推論能力を評価するために設計された、新しい質問応答ベンチマークである。本分析により,モデルの基本的数値演算を行ない,特定の情報を長い文脈で正確に特定する能力において,系統的な欠点が明らかとなった。
論文参考訳（メタデータ） (Thu, 25 Sep 2025 11:36:09 GMT)
「This paper introduced SciTrek, a benchmark designed for testing the ability of LLMs to perform multi-document information synthesis and structured reasoning over full-text scientific articles. 」と科学分野のマルチドキュメント・長文ベンチマーク。
リポジトリはGitHub – oaimli/SciTrek: Benchmarking long-context language models on scientific articles

MuSLR: Multimodal Symbolic Logical Reasoning

MuSLR: Multimodal Symbolic Logical Reasoning [133.9]
マルチモーダルな論理的推論は、自律運転や診断などの高度な応用において重要である。形式論理規則を基礎としたマルチモーダルな記号論理的推論のための最初のベンチマーク Mu SLR を導入する。我々は,GPT-4.1のChain-of-Thought性能を14.13%向上させるモジュール型フレームワークであるLogiCAMを提案する。
論文参考訳（メタデータ） (Tue, 30 Sep 2025 06:42:20 GMT)
Multimodal symbolic logical reasoningを対象とするベンチマークMuSLRの構築。またベースラインとしてモジュラー構成のLogiCAMを提案している。現在のフロンティアなモデルでも難しいベンチマークのよう。
改善のための「First, integrating dedicated symbolic modules is essential: the LogiCAM outperforms base VLMs precisely because it extracts multimodalities based on logic and embeds explicit symbolic reasoning steps. Second, existing VLMs struggle to align and fuse visual and textual information when performing formal logic; Future work should explore tighter multimodal integration, such as cross-modal architectures trained with logic-grounded objectives, to bridge this gap.」という指摘が興味深く、現行モデルは形式的な処理に苦労しているように見える。
リポジトリはMuSLR: Multimodal Symbolic Logical Reasoning

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

MAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents [15.0]
本稿ではGUIショートカットハイブリッドエージェントの評価の先駆けとなるベンチマークであるMAS-Benchを紹介する。 11の現実世界アプリケーションに139の複雑なタスク、88のショートカットの知識ベース、RPAスクリプト、そして7つの評価メトリクスがある。実験の結果、ハイブリッドエージェントはGUIのみのエージェントよりも成功率と効率が著しく高いことがわかった。
論文参考訳（メタデータ） (Mon, 08 Sep 2025 09:43:48 GMT)
GUI操作をショートカットする（画面を操作せずにAPIコールするなど）ことも含めたベンチマークの提案。
プロジェクトサイトはMAS-Bench: A Unified Benchmark for Shortcut-Augmented Hybrid Mobile GUI Agents

Fluid Language Model Benchmarking

Fluid Language Model Benchmarking [126.9]
我々は,複数の次元にわたるLMベンチマークを進展させる新しい評価手法であるFluid Benchmarkingを紹介する。サイコメトリックスにインスパイアされたFluid Benchmarkingは、ベンチマーク項目の相対値がLMの能力レベルに依存するという洞察に基づいている。効率性,妥当性,分散性,飽和性の4つの次元を検証した結果,Fluid Benchmarkingがすべてにおいて優れた性能を発揮することがわかった。
論文参考訳（メタデータ） (Sun, 14 Sep 2025 05:49:42 GMT)
「we introduce FLUID BENCHMARKING, a new evaluation approach that advances LM benchmarking across multiple dimensions. Inspired by psychometrics, FLUID BENCHMARKING is based on the insight that the relative value of benchmark items depends on an LM’s capability level, suggesting that evaluation should adapt to each LM. Methodologically, FLUID BENCH- MARKING estimates an item response model based on existing LM evaluation results and uses the inferred quantities to select evaluation items dynamically, similar to computerized adaptive testing in education.」との評価方法の提案。
リポジトリはGitHub – allenai/fluid-benchmarking: Fluid Language Model Benchmarking

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs [35.2]
大規模言語モデル(LLM)は、外部環境において様々なツールを自律的に呼び出す上で、優れたパフォーマンスを示している。本稿では, LLMツール利用の安全性を評価するために, ツールを直接実行することによって生じる不可逆的な害を避けることを目的としている。ツール利用セキュリティを総合的に評価する最初のベンチマークであるSafeToolBenchを提案する。ツール利用セキュリティに対するLCMの認識を3つの観点から向上することを目的とした,新しいフレームワークであるSafeInstructToolも提案する。
論文参考訳（メタデータ） (Tue, 09 Sep 2025 01:31:25 GMT)
LLMのツール利用におけるセキュリティを評価するベンチマーク、「we further pro- pose SafeInstructTool, the first framework to evaluate risks across these three perspectives from nine dimensions: User Instruction Perspective (Data Sensitivity, Harmfulness of the Instruction, Urgency of the Instruction, Frequency of Tool Utilization in the Instruction), Tool Itself Perspective (Key Sensitivity, Type of Operation, Impact Scope of the Operation) and Joint Instruction-Tool Perspective (Alignment Between Instruction and Tool, Value Sensitivity). Thus, it can enhance LLMs’ awareness of tool utilization safety, leading to more safer and trustworthy language agents.」とのこと
リポジトリはGitHub – BITHLP/SafeToolBench: [2025 EMNLP Findings] SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models

SafeProtein: Red-Teaming Framework and Benchmark for Protein Foundation Models [48.3]
本稿では,タンパク質基盤モデル用に設計された最初のレッドチームフレームワークであるSafeProteinを紹介する。 SafeProteinはマルチモーダルプロンプトエンジニアリングを組み合わせ、ビームサーチを生成して、レッドチーム方式を体系的に設計する。また、手動で構築したレッドチームベンチマークデータセットと包括的な評価プロトコルを含むSafeProtein-Benchをキュレートした。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 17:13:56 GMT)
「• SafeProtein: the first systematic red-teaming approach for protein foundation models, combining multimodal prompt engineering with heuristic beam search, achieving up to a 70% jailbreak success rate against the latest ESM3 model.」というフレームワークと、関連するベンチマークの紹介。
リポジトリはGitHub – jigang-fan/SafeProtein: Official Repository for SafeProtein and SafeProtein-Bench

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games

FlashAdventure: A Benchmark for GUI Agents Solving Full Story Arcs in Diverse Adventure Games [56.8]
我々はFlashAdventureを紹介した。これは、フルストーリーのアーク補完をテストするために設計された、34のFlashベースのアドベンチャーゲームのベンチマークである。また,ゲームプレイの自動評価装置であるCUA-as-a-Judgeと,長期記憶を利用したエージェントフレームワークであるCOASTを提案する。実験では、現在のGUIエージェントがフルストーリーのアークに苦しむのに対して、COASTは観察と振る舞いのギャップを埋めることでマイルストーンの完了を改善する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 01:33:16 GMT)
アドベンチャーゲームを利用したベンチマークと「We also propose CUA-as-a-Judge, an automated gameplay evaluator, and COAST, an agentic framework leveraging long-term clue memory to better plan and solve sequential tasks. Experiments show current GUI agents struggle with full story arcs, while COAST improves mile- stone completion by bridging the observation- behavior gap.」という評価システムの提案。現状のSuccess Rateはとても低いが今後どのくらいの速度で改善していくかが楽しみ。
プロジェクトサイトはFlashAdventure

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants

HumanAgencyBench: Scalable Evaluation of Human Agency Support in AI Assistants [5.5]
エージェントの哲学的・科学的理論とAIを用いた評価手法を統合することにより、人間エージェントの考え方を発展させる。我々は、典型的なAIのユースケースに基づいて、6次元の人間エージェントを持つスケーラブルで適応的なベンチマークであるHumanBench(HAB)を開発した。
論文参考訳（メタデータ） (Wed, 10 Sep 2025 11:10:10 GMT)
AIエージェントが人間の主体性をどのように扱うかに関するベンチマーク。複数のカテゴリ（Experimental-Orange/HumanAgencyBench_Evaluation_Results · Datasets at Hugging Face）に対して評価可能。「There is substantial variation across model developers—with Anthropic’s Claude models tending to most support human agency—and across dimensions. We encourage further research into human agency as more human tasks and decisions are delegated to AI systems, ensuring humans maintain appropriate levels of control.」とモデルによって挙動が異なるよう。
リポジトリはGitHub – BenSturgeon/HumanAgencyBench: A code repository for the paper: “HUMANAGENCYBENCH: Scalable Evaluation of Human Agency Support in AI Assistants”

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis

DeepScholar-Bench: A Live Benchmark and Automated Evaluation for Generative Research Synthesis [52.6]
本稿では,生のベンチマークと総合的自動評価フレームワークであるDeepScholar-benchを紹介する。 DeepScholar-benchは、最近の高品質なArXiv論文からクエリを抽出し、真の研究合成タスクにフォーカスしている。また,LOTUS APIを用いて効率的に実装した参照パイプラインであるDeepScholar-baseを開発した。
論文参考訳（メタデータ） (Wed, 27 Aug 2025 16:36:34 GMT)
「DeepScholar- bench draws queries from recent, high-quality ArXiv papers and focuses on a real research synthesis task: generating the related work sections of a paper by retrieving, synthesizing, and citing prior research.」というベンチマークの提案。Live benchmarkとなっている。
プロジェクトサイトはGitHub – guestrin-lab/deepscholar-bench: benchmark and evaluate generative research synthesis

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution [13.6]
大規模言語モデル(LLM)は、知識や推論能力をテストするタスクで一般的に評価される。本稿では、モデルが出力の特性を予測できる能力を測定するセルフ実行ベンチマークを紹介する。私たちの実験では、モデルが一般的にこのベンチマークではパフォーマンスが悪く、モデルのサイズや能力が向上しても、常にパフォーマンスが向上するとは限らないことが示されています。
論文参考訳（メタデータ） (Sun, 17 Aug 2025 07:57:58 GMT)
「Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model’s ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this bench- mark, and that increased model size or capability does not consistently lead to better performance.」という変わったベンチマーク。メタな視点になっていて結果を含めとても興味深い。
リポジトリはGitHub – anon-researcher-2025/Self-Execution-Benchmark

月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31