2025年4月28日 – arXiv最新論文の紹介

UFO2: The Desktop AgentOS , UI-TARS-1.5

UFO2: The Desktop AgentOS [60.3]
UFO2はWindowsデスクトップ用のマルチエージェントAgentOSで、実用的なシステムレベルの自動化に発展している。我々は、20以上の現実世界のWindowsアプリケーションに対してUFO2を評価し、従来のCUAよりもロバスト性および実行精度を大幅に改善した。我々の結果は、ディープOSの統合によって、信頼性の高いユーザ指向のデスクトップ自動化へのスケーラブルな道が開けることを示している。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 13:04:43 GMT)
OS-COPILOT/FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You)とUFO（UI-Focused） – arXiv最新論文の紹介のバージョン2。AgentSもだが、バージョンが上がっていくのにこの分野の盛り上がりを感じる。bytedanceからはUI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介の次バージョンUI-TARS-1.5がでている（UI-TARS：Next-generation native GUI agent model designed to interact seamlessly with GUIs using human-like perception、下記）
リポジトリはGitHub – microsoft/UFO: The Desktop AgentOS.

Introducing UI-TARS-1.5
UI-TARS-1.5は、強力な視覚言語モデル上に構築されたオープンソースのマルチモーダルエージェントである。強化学習によって実現される高度な推論を統合する。さまざまな標準ベンチマークで最先端の結果が得られる。

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.8]
本稿では,テスト時間スケーリングベンチマークの判定評価について紹介する。 3つのタスク設定の下で、3つのドメイン(推論、コード生成、命令従)での判定性能を評価する。我々のベンチマークは、審査員が再評価において結果報酬モデルと競合する一方で、ビームサーチにおけるプロセス報酬モデルよりも一貫して悪いことを示している。
論文参考訳（メタデータ） (Mon, 21 Apr 2025 17:33:23 GMT)
「we seek to understand the feasibility of using LLM-judges in place of typically used RMs in testtime compute procedures.」というモチベーションでの「we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement.」というベンチマークの提案。「We find that weak judges can help strong generators in easier tasks, such as instruction following, but not in reasoning-intensive tasks like coding or math. Larger judges bring the most benefit for math and instruction following tasks, but no evaluated judges are able to reliably improve generator performance for coding. Lastly, while natural language critiques are touted as a defining advantage of judges over RMs, we find that such critiques have significant room for improvement in terms of utility.」となかなか厳しい結果。
リポジトリはGitHub – SalesforceAIResearch/jetts-benchmark: Code repository for the paper “Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators”

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks [37.8]
本稿では148カ国の2000以上の多言語(非英語)ベンチマークについて検討する。英語はこれらのベンチマークで著しく過剰に表現されている。ほとんどのベンチマークは翻訳よりもオリジナルの言語コンテンツに依存している。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 01:47:37 GMT)
多言語ベンチマークに対する調査報告。「Importantly, simply translating English benchmarks proves insufficient for robust evaluation, localized benchmarks (like CMMLU for Chinese) show substantially higher correlation with human judgments (0.68) than translated equivalents (0.47 and 0.49), highlighting the critical need for culturally and linguistically authentic evaluation resources.」というのはそうだろうと思いつつ、数字で示されると納得感がある。

月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30