2025年9月3日 – arXiv最新論文の紹介

MCPに関するベンチマークがでていた。両ベンチマークともGPT-5の性能が高いとのことだが、多くのMCPサーバや周辺ツール・ライブラリがGPT-4/4.1/4.5/5などに対してチューニングされている面もあるように思わなくもない。

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries [38.6]
提案するLiveMCP-101は,リアルタイムクエリを慎重にキュレートした101のベンチマークである。実験により、フロンティアのLLMでさえ60%未満の成功率を達成することが示された。 LiveMCP-101は現実世界のエージェント能力を評価するための厳格な標準を設定している。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 17:55:54 GMT)
「we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis.」というベンチマーク。
GPT-5の性能が高い。

MCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers [86.0]
MCP-Universeは,実世界のMPPサーバとのインタラクションを通じて,現実的かつ困難なタスクにおいてLLMを評価するために設計された,初めての総合ベンチマークである。私たちのベンチマークでは、ロケーションナビゲーション、リポジトリ管理、財務分析、3Dデザイン、ブラウザ自動化、Web検索という、11の異なるMSPサーバにまたがる6つのコアドメインを網羅しています。 GPT-5 (43.72%) やGrok-4 (33.33%) やClaude-4.0-Sonnet (29.44%) のようなSOTAモデルでさえ、大幅な性能制限がある。
論文参考訳（メタデータ） (Wed, 20 Aug 2025 13:28:58 GMT)
こちらも「we introduce MCP-Universe, the first comprehensive benchmark specifically designed to evaluate LLMs in realistic and hard tasks through interaction with real-world MCP servers. Our benchmark encompasses 6 core domains spanning 11 different MCP servers: Location Navigation, Repository Management, Financial Analysis, 3D Design, Browser Automation, and Web Searching.」とMCPのベンチマーク、GPT-5の性能が高い。
プロジェクトサイトはMCP-Universe: Benchmarking Large Language Models with Real-World Model Context Protocol Servers、リポジトリはGitHub – SalesforceAIResearch/MCP-Universe: MCP-Universe is a comprehensive framework designed for developing, testing, and benchmarking AI agents

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers [24.7]
MCP-Benchは、大規模言語モデル(LLM)を現実的なマルチステップタスクで評価するためのベンチマークである。 MCP-Bench は Model Context Protocol (MCP) 上に構築されており、金融、旅行、科学計算、学術検索などの分野にまたがる250のツールにまたがる28のライブ MCP サーバに LLM を接続している。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 05:58:57 GMT)
アクセンチュアによるベンチマーク。GPT-5、o3、GPT-OSS 120B、Gemini 2.5 Pro、Claude sonnet 4と続く結果。
- 感覚とかなり異なる印象でMCPサーバ側がGPT系モデルに寄せている気がする
リポジトリはGitHub – Accenture/mcp-bench: MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

OneRec-V2 Technical Report [93.9]
OneRecは、自己回帰生成タスクとしてレコメンデーションを再構築し、高いモデルFLOPの利用を達成する。 Lazy Decoder-Only Architecture: エンコーダボトルネックを排除し、全体の計算を94%削減し、トレーニングリソースを90%削減する。現実のユーザインタラクションによる優先度調整: ユーザの好みに合うように、継続意識のリワードシェイピングとアダプティブ比クリッピングを組み込む。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 15:29:51 GMT)
ARモデルを用いたレコメンデーション
「Scaling: Although we observed a continuous decrease in loss as the model scaled from 0.1B to 8B, the downward trend does not strictly adhere to scaling laws (Kaplan et al , 2020)」とのことだが、それっぽい挙動は見えているのが面白い。

日: 2025年9月3日

LiveMCP-101, MCP-Universe, MCP-Bench

OneRec-V2 Technical Report

2025年9月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30