2025年6月3日 – arXiv最新論文の紹介

The Real Barrier to LLM Agent Usability is Agentic ROI

The Real Barrier to LLM Agent Usability is Agentic ROI [110.3]
大規模言語モデル(LLM)エージェントは、人間とAIの相互作用において有望な変化を示す。我々は、需要の高いマスマーケットアプリケーションにおいて、重要なユーザビリティギャップを強調します。
論文参考訳（メタデータ） (Fri, 23 May 2025 11:40:58 GMT)
「we argue that the key barrier to the practical usability of LLM agents lies not in model capability alone, but in maximizing the value an agent can provide, while minimizing the costs incurred during real-world use.」というごもっとな主張で、それを測るメトリクスとしてAgentic ROIを提案。「The massive user demand and the low Agentic ROI highlight a critical usability gap in everyday, mass-market applications.」はその通りと思う。
色々開発している側としては「In particular, the current generation of LLM agents focuses on specialized, professional tasks such as software development [97] and scientific research [24, 65], where the typical users are already domain experts and occasional errors are acceptable. As a result, these agents remain largely out of reach for the general public, who may lack the necessary expertise.」もその通りで耳が痛い・・・

The Avengers: A Simple Recipe for Uniting Smaller Language Models to Challenge Proprietary Giants [66.7]
我々は、オープンソースのより小さな言語モデルの集合的インテリジェンスを効果的に活用する簡単なレシピであるAvengersを紹介します。 10のオープンソースモデル(それぞれ7Bパラメータ)により、Avengersは15のデータセットのうち10でGPT-4.1を上回っている。特に数学タスクでは GPT-4.1 を 18.21% 、コードタスクでは 7.46% で上回っている。
論文参考訳（メタデータ） (Mon, 26 May 2025 10:29:42 GMT)
7B × 10のSLMで商用モデルと競合する性能を達成とのこと。「In this paper, we introduce the Avengers, a simple yet effective framework to unite multiple smaller language models (SLMs) and challenge the dominance of proprietary large models. The core of the Avengers involves straightforward embedding, clustering, scoring, and voting, without requiring neural network training, prompt engineering, or careful architecture-specific model choices.」
leakというのが頭によぎらなくはないが、近年の公開モデルの性能は大きく向上していてあり得る結果ではあると思う。
リポジトリはGitHub – ZhangYiqun018/Avengers

TurnaboutLLM: A Deductive Reasoning Benchmark from Detective Games [9.2]
本稿では,Large Language Models(LLM)の推論能力を評価するための新しいフレームワークとデータセットであるTurnaboutLLMを紹介する。このフレームワークは、長い物語の文脈の中で、証言と証拠の間の矛盾を識別するLLMを処理します。提案手法は,12種類のLLMをデータセット上で評価し,導出的推論を向上するための一般的な戦略の限界を示唆した。
論文参考訳（メタデータ） (Wed, 21 May 2025 16:22:32 GMT)
逆転裁判やダンガンロンパを使ったLLMの性能評価ベンチマークの提案。攻略サイトなどがLeakになっていそうだが、総合力が試されるベンチマークではあると思う。LRMが優勢な結果（まぁそうだろうと思う）。
リポジトリはGitHub – zharry29/turnabout_llm

lmgame-Bench: How Good are LLMs at Playing Games? [60.0]
本稿では,現代の大規模言語モデル (LLM) エージェントを評価するために,人気ゲームを使用する上での大きな課題について検討する。我々はlmgame-Benchを導入し、ゲームを信頼性評価に変換する。
論文参考訳（メタデータ） (Wed, 21 May 2025 06:02:55 GMT)
こちらもゲームを用いたベンチマーク・評価。「We study the major challenges in using popular video games to evaluate modern LLMs and find that directly dropping LLMs into games cannot make an effective evaluation, for three reasons: brittle vision perception, prompt sensitivity, and potential data contamination.」とLeakの課題が大きいことも指摘している。
リポジトリはGitHub – lmgame-org/GamingAgent: Computer gaming agents that run on your PC and laptops.下のhttps://github.com/lmgame-org/GamingAgent/lmgame-benchとのことだが、現状では404