arXiv最新論文の紹介

UFO2: The Desktop AgentOS , UI-TARS-1.5

UFO2: The Desktop AgentOS [60.3]
UFO2はWindowsデスクトップ用のマルチエージェントAgentOSで、実用的なシステムレベルの自動化に発展している。我々は、20以上の現実世界のWindowsアプリケーションに対してUFO2を評価し、従来のCUAよりもロバスト性および実行精度を大幅に改善した。我々の結果は、ディープOSの統合によって、信頼性の高いユーザ指向のデスクトップ自動化へのスケーラブルな道が開けることを示している。
論文参考訳（メタデータ） (Sun, 20 Apr 2025 13:04:43 GMT)
OS-COPILOT/FRIDAY (Fully Responsive Intelligence, Devoted to Assisting You)とUFO（UI-Focused） – arXiv最新論文の紹介のバージョン2。AgentSもだが、バージョンが上がっていくのにこの分野の盛り上がりを感じる。bytedanceからはUI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介の次バージョンUI-TARS-1.5がでている（UI-TARS：Next-generation native GUI agent model designed to interact seamlessly with GUIs using human-like perception、下記）
リポジトリはGitHub – microsoft/UFO: The Desktop AgentOS.

Introducing UI-TARS-1.5
UI-TARS-1.5は、強力な視覚言語モデル上に構築されたオープンソースのマルチモーダルエージェントである。強化学習によって実現される高度な推論を統合する。さまざまな標準ベンチマークで最先端の結果が得られる。

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators

Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators [66.8]
本稿では,テスト時間スケーリングベンチマークの判定評価について紹介する。 3つのタスク設定の下で、3つのドメイン(推論、コード生成、命令従)での判定性能を評価する。我々のベンチマークは、審査員が再評価において結果報酬モデルと競合する一方で、ビームサーチにおけるプロセス報酬モデルよりも一貫して悪いことを示している。
論文参考訳（メタデータ） (Mon, 21 Apr 2025 17:33:23 GMT)
「we seek to understand the feasibility of using LLM-judges in place of typically used RMs in testtime compute procedures.」というモチベーションでの「we introduce the Judge Evaluation for Test-Time Scaling (JETTS) benchmark, which evaluates judge performance in three domains (math reasoning, code generation, and instruction following) under three task settings: response reranking, step-level beam search, and critique-based response refinement.」というベンチマークの提案。「We find that weak judges can help strong generators in easier tasks, such as instruction following, but not in reasoning-intensive tasks like coding or math. Larger judges bring the most benefit for math and instruction following tasks, but no evaluated judges are able to reliably improve generator performance for coding. Lastly, while natural language critiques are touted as a defining advantage of judges over RMs, we find that such critiques have significant room for improvement in terms of utility.」となかなか厳しい結果。
リポジトリはGitHub – SalesforceAIResearch/jetts-benchmark: Code repository for the paper “Evaluating Judges as Evaluators: The JETTS Benchmark of LLM-as-Judges as Test-Time Scaling Evaluators”

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks

The Bitter Lesson Learned from 2,000+ Multilingual Benchmarks [37.8]
本稿では148カ国の2000以上の多言語(非英語)ベンチマークについて検討する。英語はこれらのベンチマークで著しく過剰に表現されている。ほとんどのベンチマークは翻訳よりもオリジナルの言語コンテンツに依存している。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 01:47:37 GMT)
多言語ベンチマークに対する調査報告。「Importantly, simply translating English benchmarks proves insufficient for robust evaluation, localized benchmarks (like CMMLU for Chinese) show substantially higher correlation with human judgments (0.68) than translated equivalents (0.47 and 0.49), highlighting the critical need for culturally and linguistically authentic evaluation resources.」というのはそうだろうと思いつつ、数字で示されると納得感がある。

Seedream 3.0 Technical Report

Seedream 3.0 Technical Report [62.9]
Seedream 3.0は、高性能な中国語と英語のバイリンガル画像生成基盤モデルである。 Seedream 2.0の既存の課題に対処するために、いくつかの技術的改善を開発しています。 Seedream 3.0はネイティブな高解像度の出力(最大2K)を提供し、高画質の画像を生成する。
論文参考訳（メタデータ） (Wed, 16 Apr 2025 16:23:31 GMT)
ByteDanceによるマルチリンガルな画像生成モデル、サンプル画像から非常に強力なモデルであることが分かる。Text to Image Model Arena | Artificial AnalysisでSoTAを主張（現在はGPT-4oに抜かれている？）
プロジェクトサイトはDoubao Team

Foundation Models for Environmental Science: A Survey of Emerging Frontiers

Foundation Models for Environmental Science: A Survey of Emerging Frontiers [27.8]
本調査は,環境科学における基礎的応用の概要を概観する。これは、フォワード予測、データ生成、データ同化、ダウンスケーリング、逆モデリング、モデルエンハンブル、ドメイン間の意思決定など、一般的な環境ユースケースにおける進歩を強調している。我々は、重要な環境問題に対処する上での発見を促進するために、機械学習の進歩を加速する学際的なコラボレーションを促進することを目的としている。
論文参考訳（メタデータ） (Sat, 05 Apr 2025 20:56:38 GMT)
「This survey presents a comprehensive overview of foundation model applications in environmental science, highlighting advancements in common environmental use cases including forward prediction, data generation, data assimilation, downscaling, inverse modeling, model ensembling, and decision-making across domains.」というサーベイ。

MM-IFEngine: Towards Multimodal Instruction Following

MM-IFEngine: Towards Multimodal Instruction Following [85.9]
高品質なイメージインストラクションペアを生成するパイプラインであるMM-IFEngineを提案する。 MM-IFInstruct-23kはSFT(Supervised Fine-Tuning)に適しているが、DPO(Direct Preference Optimization)のためにMM-IFDPO-23kとして拡張されている。また、MM-IFEvalは、困難で多様なマルチモーダル命令追従ベンチマークである。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 17:59:12 GMT)
「the instruction-following ability of Multimodal Large Language Models」のベンチマークとモデル（公開モデルベース）の提案。商用モデルの強力さが目立つ。また、「DPO using MM-IFDPO-23k significantly surpasses SFT on MMIFInstruct-23k」は興味深い。
リポジトリはGitHub – SYuan03/MM-IFEngine: MM-IFEngine: Towards Multimodal Instruction Following

Exploring Expert Failures Improves LLM Agent Tuning

Exploring Expert Failures Improves LLM Agent Tuning [76.3]
本稿では,失敗した専門家の軌道から有益な行動を識別する専門的失敗の探索(EEF)を提案する。 EEFは、未解決のいくつかのサブタスクをうまく解決し、エージェントチューニング性能を改善する。
論文参考訳（メタデータ） (Thu, 17 Apr 2025 17:53:54 GMT)
「In this paper, we present EEF, a novel framework that learns beneficial actions from negative expert data while remaining robust against noise from suboptimal actions.」、WebShopと SciWorldベンチマークでSoTAを主張

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation

LANGTRAJ: Diffusion Model and Dataset for Language-Conditioned Trajectory Simulation [94.8]
LangTrajは、トラフィックシナリオにおけるすべてのエージェントの共同動作をシミュレートする、言語条件のシーン拡散モデルである。自然言語入力を条件付けすることで、LangTrajはインタラクティブな振る舞いを柔軟かつ直感的に制御できる。 LangTraj氏は、リアリズム、言語制御性、言語条件の安全クリティカルなシミュレーションにおいて、強力なパフォーマンスを示している。
論文参考訳（メタデータ） (Tue, 15 Apr 2025 17:14:06 GMT)
「LANGTRAJ advances autonomous vehicle simulation by leveraging language-conditioned diffusion models to generate diverse, behaviorally rich scenarios.」という軌道生成手法の提案

UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents

UXAgent: A System for Simulating Usability Testing of Web Design with LLM Agents [33.9]
大規模言語モデルシミュレーションエージェント(textbfLLM Agent)研究の最近の進歩は、textbfUXAgentを設計するきっかけとなった。システムにはペルソナジェネレータモジュール,LDMエージェントモジュール,ユニバーサルブラウザコネクタモジュールがあり,数千のシミュレーションユーザを自動的に生成する。
論文参考訳（メタデータ） (Sun, 13 Apr 2025 02:34:22 GMT)
「In this work, we designed UXAgent, a system enabling researchers to conduct simulated user studies, thereby facilitating iterative refinement of their UX study designs.」というフレームワークの提案
いろいろなペルソナを使えるというのは利点だと思う一方、どのくらいの妥当性があるものだろうか。

Future-Proof Yourself: An AI Era Survival Guide

Future-Proof Yourself: An AI Era Survival Guide [2.7]
Future-Proof Yourselfは、読者が急速に変化する人工知能の世界をナビゲートする実践的なガイドだ。この本は、コンピュータがシンプルで相対的な言葉でデータからどのように学習するかを説明することから始まる。機械学習の基本的なアイデアが、画像を認識し、言語を理解し、さらには意思決定できる高度なシステムへとどのように進化していくかを示している。
論文参考訳（メタデータ） (Sun, 06 Apr 2025 06:11:29 GMT)
教科書的な分量のあるサバイバルガイド（？）
Home | MIMIC

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31