2026年2月24日 – arXiv最新論文の紹介

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation

When AI Benchmarks Plateau: A Systematic Study of Benchmark Saturation [80.7]
主要モデル開発者のテクニカルレポートから選択した60のLarge Language Model (LLM)ベンチマークのベンチマーク飽和を分析した。分析の結果、ベンチマークのほぼ半数が飽和しており、ベンチマークの年齢とともに上昇していることがわかった。専門家によるベンチマークは、クラウドソースのベンチマークよりも飽和に抵抗する。
論文参考訳（メタデータ） (Wed, 18 Feb 2026 16:51:37 GMT)
多くのベンチマークが急速に解かれるように感じる状況について整理した論文。「Benchmarks with held-out or private test data do not exhibit systematically lower saturation than public ones. While contamination and memorization are well- documented risks (Zhou et al , 2023b; Balloccu et al , 2024; Deng et al , 2024; Sainz et al , 2024), secrecy alone does not prevent compression once distributional characteristics become widely known.」というのは若干意外だった。
プロジェクトサイトはEvalEval Coalition | We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.

Data Science and Technology Towards AGI Part I: Tiered Data Management [53.6]
我々は、人工知能の開発がデータモデル共進化の新しい段階に入ったと論じる。我々は、未処理のリソースから組織的で検証可能な知識まで、L0-L4階層のデータ管理フレームワークを紹介します。提案手法の有効性を実証研究により検証する。
論文参考訳（メタデータ） (Mon, 09 Feb 2026 18:47:51 GMT)
データの軸から見たAGI実現への分析、「Our results suggest that effective data management should be treated as a first-class engineering problem, rather than an auxiliary preprocessing step.」はその通りだと思う。
リポジトリはUltraData – a openbmb Collection

Mobile-Agent-v3.5: Multi-platform Fundamental GUI Agents [56.7]
この記事では、最新のネイティブGUIエージェントモデルであるGUI-Owl-1.5を紹介する。クラウドとエッジのコラボレーションとリアルタイムのインタラクションを実現するために、さまざまなプラットフォーム(デスクトップ、モバイル、ブラウザなど)をサポートしている。オープンソースモデル上で20以上のGUIベンチマークで最先端の結果を得る。
論文参考訳（メタデータ） (Sun, 15 Feb 2026 01:52:19 GMT)
AlibabaによるGUIエージェントモデル。「Built on Qwen3-VL and powered by a scalable data pipeline and a multi-stage training paradigm, GUI-Owl1.5 comprises a family of foundation GUI models covering a full range of sizes, including instruct/thinking variants at 2B, 4B, 8B, 32B, and 235B-A22B.」とのこと。
リポジトリはGitHub – X-PLUG/MobileAgent: Mobile-Agent: The Powerful GUI Agent Family