2026年2月 – ページ 3 – arXiv最新論文の紹介

CUA-Skill: Develop Skills for Computer Using Agent

CUA-Skill: Develop Skills for Computer Using Agent [48.9]
コンピュータを利用したエージェントスキルベースであるCUA-Skillを導入し,人間のコンピュータ利用知識をスキルとして符号化する。我々は、動的スキル検索、引数のインスタンス化、メモリ認識障害回復をサポートする、エンドツーエンドのコンピュータ利用エージェントであるCUA-Skill Agentを構築した。その結果、CUA-Skillは、エンドツーエンドのベンチマークで実行の成功率と堅牢性を大幅に向上することを示した。
論文参考訳（メタデータ） (Mon, 02 Feb 2026 23:11:55 GMT)
「How can we build a scalable and transferable skill base for desktop environments that captures human procedural knowledge and enables reliable and capable CUAs? In this work, we answer this question by introducing CUA- Skill, the first systematic agentic skill library designed for desktop computer use.」とSkillsを用いたCUA、かなり有効に見える。
リポジトリはCUA-Skill

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models

Outcome Accuracy is Not Enough: Aligning the Reasoning Process of Reward Models [108.3]
Rationale Consistencyは、モデルの推論プロセスと人間の判断のアライメントを定量化する、きめ細かい計量である。我々のフロンティアモデルの評価では,最先端モデル間で合理的な一貫性が効果的に識別できることが示されている。我々は、GenRMトレーニングの合理性一貫性と結果精度を組み合わせたハイブリッド信号を導入する。
論文参考訳（メタデータ） (Wed, 04 Feb 2026 15:24:52 GMT)
「Our evaluation of frontier models reveals that rationale consistency effectively discriminates among state-of-the-art models and detects deceptive alignment, while outcome accuracy falls short in both respects. To mitigate this gap, we introduce a hybrid signal that combines rationale consistency with outcome accuracy for GenRM training.」とのこと。前半の指摘について直観的にはそうだと思うものの興味深い。
リポジトリはGitHub – QwenLM/RationaleRM

POINTS-GUI-G: GUI-Grounding Journey

POINTS-GUI-G: GUI-Grounding Journey [22.4]
POINTS-GUIG-8Bは、ScreenSpotProで59.9、OSWorld-Gで66.0、ScreenSpot-v2で95.7、UIVisionで49.9のスコアで最先端のパフォーマンスを実現する。モデルの成功は,(1)データ工学の精錬,(2)訓練戦略の改善,(3)検証されたリワードによる強化学習の3つの要因によって引き起こされる。
論文参考訳（メタデータ） (Fri, 06 Feb 2026 05:14:11 GMT)
GUI groundingで良い性能を出す小型モデルの提案。「(1) Refined Data Engineering, involving the unification of diverse open-source datasets format alongside sophisticated strategies for augmentation, filtering, and difficulty grading; (2) Improved Training Strategies, including continuous fine-tuning of the vision encoder to enhance perceptual accuracy and maintaining resolution consistency between training and inference; and (3) Reinforcement Learning (RL) with Verifiable Rewards.」と構築過程も参考になる。
リポジトリはGitHub – Tencent/POINTS-GUI

UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

UI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents [50.1]
オンライン強化学習(RL)は、直接的な環境相互作用を通じてGUIエージェントを強化するための有望なパラダイムを提供する。階層的エクスペリエンスメモリによるGUIオンラインRLを強化する新しいフレームワークであるUI-Memを提案する。 UI-Memは従来のRLベースラインや静的再利用戦略よりも大幅に優れています。
論文参考訳（メタデータ） (Thu, 05 Feb 2026 16:21:43 GMT)
「constructs a hierarchical, self-evolving memory that decom- poses raw experiences into reusable workflows, subtask skills, and failure patterns. We utilized this memory through a stratified group sampling mechanism tailored for GRPO, which balances memory-guided exploitation with necessary exploration to facilitate effective advantage estimation.」とGUIエージェントのためのメモリ機能提案。
リポジトリはUI-Mem: Self-Evolving Experience Memory for Online Reinforcement Learning in Mobile GUI Agents

UI-Venus-1.5 Technical Report

UI-Venus-1.5 Technical Report [64.5]
We present UI-Venus-1.5, an unified, end-to-end GUI Agent。提案したモデルファミリーは、2つの高密度変種(2Bと8B)と1つの混合専門家変種(30B-A3B)からなる。さらに、UI-Venus-1.5は、さまざまな中国のモバイルアプリで堅牢なナビゲーション機能を示している。
論文参考訳（メタデータ） (Mon, 09 Feb 2026 18:43:40 GMT)
UI Venusのver 1.5、「 Unified Single-Agent via Model Merging: A major distinction from UI-Venus-1.0 is that UI-Venus-1.5 is a purely end-to-end model, which greatly simplifies deployment for users.」と1.5と言っているがだいぶ異なるように思える。
リポジトリはGitHub – inclusionAI/UI-Venus: UI-Venus is a native UI agent designed to perform precise GUI element grounding and effective navigation using only screenshots as input.

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision [15.8]
Sci-CoEは2段階の科学的共進化フレームワークであり、モデルが解法と検証器の両方として自己進化することを可能にする。最初の段階では、モデルは注釈付きデータの小さなセットを使用して、検証器の正当性判定アンカーを確立する。第2段階では、コンセンサス、信頼性、多様性を共同で考慮し、大規模な自己評価を促進する幾何学的報酬機構を導入する。
論文参考訳（メタデータ） (Thu, 12 Feb 2026 16:46:00 GMT)
「we introduce Sci-CoE, a scientific co-evolving framework that consists of a Solver and a Verifier, both implemented within a single LLM.」と協調的に進化していくタイプのモデル。ベースモデルよりも性能が向上。
リポジトリはGitHub – InternScience/Sci-CoE: Sci-CoE: Co-evolving Scientific Reasoning LLMs via Geometric Consensus with Sparse Supervision

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery [138.0]
InternAgent-1.5は、エンドツーエンドの科学的発見を目的とした統合システムである。このシステムは、生成、検証、進化のための3つの調整されたサブシステムで構成される構造化アーキテクチャ上に構築されている。 InternAgent-1.5をGAIA,HLE,GPQA,FrontierScienceなどの科学的推論ベンチマークで評価した。
論文参考訳（メタデータ） (Mon, 09 Feb 2026 18:36:06 GMT)
「A Unified Architecture for End-to-end Scientific Discovery: InternAgent-1.5 organizes the scientific discovery process into three coherent subsystems for Generation, Verification, and Evolution. These subsystems support the full cycle of hypothesis formulation, methodological evaluation, and evidence driven refinement through foundational capabilities for deep research, solution refinement, and long horizon memory.」と科学的な発見を目指したAgentic Frameworkの提案。
リポジトリはGitHub – InternScience/InternAgent: InternAgent-1.5: A Unified Agentic Framework for Long-Horizon Autonomous Scientific Discovery

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining

QuantaAlpha: An Evolutionary Framework for LLM-Driven Alpha Mining [24.8]
QuantaAlphaは進化的アルファ・マイニング・フレームワークであり、各エンド・ツー・エンドのマイニング・ランを軌跡として扱う。 QuantaAlphaは、ターゲットリビジョンのための各トラジェクトリにおける最適以下のステップをローカライズする。ファクタ生成の間、QuantaAlphaは仮説、因子表現、実行可能コードのセマンティック一貫性を強制する。
論文参考訳（メタデータ） (Fri, 06 Feb 2026 08:08:04 GMT)
「We present QuantaAlpha, a self-evolving framework for interpretable alpha mining that formulates factor discovery as a constrained multi-agent research process. Extensive experiments across both Chinese and U.S. equity markets show that QuantaAlpha consistently produces more stable and generalizable factors than all baselines.」とのこと。「(A) Diversified Planning Initialization to generate candidate hypotheses, (B) Factor Realization that iteratively instantiates hypotheses into executable factors with constraint gating, (C) Self-Evolution that applies mutation and crossover over evaluated trajectories, and (D) A Final Factor Pool that consolidates validated effective factors.」というプロセスで実現していて、テスト結果の通りのパフォーマンスであればとても興味深い。
リポジトリはGitHub – QuantaAlpha/QuantaAlpha: QuantaAlpha transforms how you discover quantitative alpha factors by combining LLM intelligence with evolutionary strategies. Just describe your research direction, and watch as factors are automatically mined, evolved, and validated through self-evolving trajectories.

LLMs as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction

LLMs as Cultural Archives: Cultural Commonsense Knowledge Graph Extraction [57.2]
大規模言語モデル(LLM)は、多様なWebスケールデータから学んだ豊富な文化的知識を符号化する。文化常識知識グラフ(CCKG)構築のための反復的,即時的枠組みを提案する。対象文化が英語ではない場合でも、文化知識グラフは英語でよりよく認識されている。
論文参考訳（メタデータ） (Sun, 25 Jan 2026 20:05:04 GMT)
LLMから文化的なナレッジグラフを引きだす手法の提案と検証。「Human evaluations show that while native languages convey richer cultural depth, English outputs are generally more coherent and preferred. Empirically, augmenting LLMs with CCKG improves performance on cultural commonsense reasoning and story generation.」というのは納得感がありつつ、日本語のLLM構築の重要性を示唆しているような気もする。
リポジトリはGitHub – JuniorTonga/Cultural_Commonsense_Knowledge_Graph: [EACL 2026 Main] Framework to construct a Cultural Commonsense Knowledge Graph( CCKG) that have geographical context.

OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks

OS-Marathon: Benchmarking Computer-Use Agents on Long-Horizon Repetitive Tasks [37.0]
ロングホライズンで反復的なタスクは、プロフェッショナルな設定で一般的である。これらのタスクは、処理するデータのサイズに比例して極端な長さまで拡張できるため、人間にとって退屈な作業であることが多い。我々は2つのドメインにまたがる242の長期的反復的なタスクからなるOS-Marathonを構築し、SOTA(State-of-the-art)エージェントを評価する。
論文参考訳（メタデータ） (Wed, 28 Jan 2026 14:35:23 GMT)
「OS-Marathon is specifically tailored to evaluate CUA performance in long- horizon, repetitive execution scenarios, comprising 242 tasks across 2 domains and 7 distinct execution environments. 」と長期かつ反復的なタスクがあるGUIエージェントベンチマーク。かなり難しいベンチマークに見える。
プロジェクトサイトはOS-Marathon Benchmark

2026年2月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28