2025年8月21日 – arXiv最新論文の紹介

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding

UI-AGILE: Advancing GUI Agents with Effective Reinforcement Learning and Precise Inference-Time Grounding [16.9]
トレーニングと推論の両方においてGUIエージェントを強化するUI-AGILEを導入する。トレーニングのために,スーパービジョン・ファイン・チューニング(SFT)プロセスの一連の改善を提案する。推測のために,高解像度ディスプレイのグラウンド化精度を劇的に向上させるために,選択による分解グラウンド化を提案する。
論文参考訳（メタデータ） (Sat, 09 Aug 2025 17:51:27 GMT)
GUIエージェントの性能に大きく影響するグラウンディング能力を強化するフレームワークの提案。「UI-AGILE enhances GUI agents through improved training with a Continuous Reward function, Simple Thinking reward, and Cropping-based Resampling, and inference with Decomposed Grounding with Selection.」とのこと。
リポジトリはGitHub – KDEGroup/UI-AGILE

DocR1: Evidence Page-Guided GRPO for Multi-Page Document Understanding [97.4]
本稿では,新しいRLフレームワークであるEvidence Page-Guided GRPOで学習したMLLMであるDocR1を紹介する。 EviGRPOには、粗大な推論戦略を促進するエビデンス対応報酬機構が組み込まれている。我々は,DocR1が複数ページのタスクに対して最先端のパフォーマンスを達成し,シングルページのベンチマークにおいて強い結果を維持していることを示す。
論文参考訳（メタデータ） (Sun, 10 Aug 2025 12:03:45 GMT)
多くのページがあるドキュメント読解のためのフレームワークの提案。
「When engaging in multi-page reading comprehension, humans typically begin by identifying the pages likely to contain the answer, and then focus on locating the specific regions that correspond to the question and answer within those pages. Inspired by this “coarse-to-fine” reading strategy, EviGRPO mimics the human approach by first selecting a small set of potentially relevant pages at a coarse level, followed by fine-grained reasoning over the selected content.」とのことだが、このようなドメイン（タスク）特化のアプローチはいまだ有効なんだろうか。。

LLMEval-3: A Large-Scale Longitudinal Study on Robust and Fair Evaluation of Large Language Models [51.6]
静的ベンチマークにおけるLLM(Large Language Models)の既存の評価は、データの汚染やリーダーボードのオーバーフィッティングに弱い。 LLMの動的評価のためのフレームワークであるLLMEval-3を紹介する。 LLEval-3は、220kの卒業生レベルの質問からなるプロプライエタリなバンク上に構築されており、評価実行毎に未確認のテストセットを動的にサンプリングする。
論文参考訳（メタデータ） (Thu, 07 Aug 2025 14:46:30 GMT)
「LLMEval-3 is built on a proprietary bank of 220k graduate-level ques- tions, from which it dynamically samples unseen test sets for each evaluation run.」というベンチマーク。今までにも指摘されてきたことではあるが公開ベンチマークはleakの影響が大きく本論文にもそのような指摘がある。
リポジトリはllmeval/LLMEval-3: 中文大语言模型评测第三期