arXiv最新論文の紹介

Towards Trustworthy GUI Agents: A Survey

Towards Trustworthy GUI Agents: A Survey [64.6]
本調査では,GUIエージェントの信頼性を5つの重要な次元で検証する。敵攻撃に対する脆弱性、シーケンシャルな意思決定における障害モードのカスケードなど、大きな課題を特定します。 GUIエージェントが普及するにつれて、堅牢な安全基準と責任ある開発プラクティスを確立することが不可欠である。
論文参考訳（メタデータ） (Sun, 30 Mar 2025 13:26:00 GMT)
GUIエージェントの信頼性に関するサーベイ。整理軸は「Security」、「Reliability」、「Explainability」、「Ethical Alignment」、「Evaluation methodologies」

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning

DocSAM: Unified Document Image Segmentation via Query Decomposition and Heterogeneous Mixed Learning [39.1]
文書画像のセグメンテーションは、文書解析と認識に不可欠である。既存のメソッドはこれらのタスクを別々に処理し、その結果、一般化とリソースの浪費が制限される。本稿では,様々な文書画像セグメンテーションタスク用に設計されたトランスフォーマーベースの統合フレームワークであるDocSAMを紹介する。
論文参考訳（メタデータ） (Sat, 05 Apr 2025 07:14:53 GMT)
MLLM全盛の現状でも重要なDocument image segmentationについて「DocSAM integrates layout analysis, multi-grained text segmentation, and table structure decomposition into a single model, reducing the need for specialized models and enhancing efficiency.」という手法の提案。
リポジトリはGitHub – xhli-git/DocSAM

Llama 4, Nemotron-H, Pangu Ultra, Kimi-VL, Kimi-VL-Thinking, Deep Coder

先週もLLM関連の話題は多かったが、Llama4の発表はその中でも大きなものだった（The Llama 4 herd: The beginning of a new era of natively multimodal AI innovation）。MoE構成で高い性能を主張、第三者の検証ではいまいちという話も、量子化の影響（性能劣化）が大きいのではという話もあって、検証結果が出そろうのを待ちたいところ。

NVIDIAからは Mamba-TransformerハイブリッドなNemotron-Hが発表されている（Nemotron-H: A Family of Accurate, Efficient Hybrid Mamba-Transformer Models – NVIDIA ADLR）。「Nemotron-H has been used as the backbone for Cosmos-Reason 1, a very strong VLM for physical AI.」というのにも注目。

HuaweiからはPangu Ultraの論文が出ているが、詳細なPDFは公開されていないよう。「To perform such large-scale training efficiently, we utilize 8,192 Ascend NPUs with a series of system optimizations. Evaluations on multiple diverse benchmarks indicate that Pangu Ultra significantly advances the state-of-the-art capabilities of dense LLMs such as Llama 405B and Mistral Large 2, and even achieves competitive results with DeepSeek-R1」という興味深い記載があり詳細が気になるところ。

Kimi-VL は強力なMLLMであり、また、Kimi-VL-ThinkingとLRMでもあるのが特徴的な公開モデル（moonshotai/Kimi-VL-A3B-Instruct · Hugging Face）。o3-miniレベルの性能を主張するDeepCoder: A Fully Open-Source 14B Coder at O3-mini Levelなどオープンなモデルも進化が速い。オープンなモデルを強化する方向もIntroducing Cogito Preview（Cogito v1 Preview – a deepcogito Collection）など様々な成果が出ていて、公開モデルの性能も向上が続く。

Nemotron-H: A Family of Accurate and Efficient Hybrid Mamba-Transformer Models [164.5]
ネモトロン-Hは8Bと56B/47Bハイブリッド・マンバ・トランスフォーマーのファミリーである。私たちは共通のTransformerモデルアーキテクチャにおけるほとんどの自己注意レイヤをMambaレイヤに置き換えます。 Nemotron-Hモデルは、他の同様のサイズのオープンソーストランスフォーマーモデルと比較して、精度が良いか低いかのどちらかを提供する。
論文参考訳（メタデータ） (Fri, 04 Apr 2025 17:41:58 GMT)
高速、高性能なMambaハイブリッドなLLM

Pangu Ultra: Pushing the Limits of Dense Large Language Models on Ascend NPUs [123.3]
135億のパラメータと高密度トランスフォーマーモジュールを持つ大規模言語モデル(LLM)であるPangu Ultraについて述べる。このような大規模トレーニングを効率的に行うためには,8,192個のAscend NPUと一連のシステム最適化を用いる。我々の調査では、Ascend NPUは1000億以上のパラメータを持つ高密度モデルを効率的かつ効果的に訓練できることを示した。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 15:41:51 GMT)
ファーウェイのLLM。ファーウェイのアクセラレータを活用して構築しているとのことだが現状論文が参照できない状態。詳細が気になるところ。

Kimi-VL Technical Report [88.1]
Kimi-VLは視覚言語モデル(VLM)であり、高度なマルチモーダル推論、長いコンテキスト理解、強力なエージェント能力を提供する。汎用 VLM として、Kimi-VL はマルチターンエージェントタスク(OSWorld など)に優れ、旗艦モデルと一致する。 Kimi-VLをベースとして、Kim-VL-Thinkingという先進的なロングシンキングモデルを導入する。
論文参考訳（メタデータ） (Thu, 10 Apr 2025 06:48:26 GMT)
エージェントタスクでも高い性能を持つマルチモーダルLLM。Thinkingバージョンはパラメータ数と比較して高い性能。
リポジトリはGitHub – MoonshotAI/Kimi-VL: Kimi-VL: Mixture-of-Experts Vision-Language Model for Multimodal Reasoning, Long-Context Understanding, and Strong Agent Capabilities, moonshotai/Kimi-VL-A3B-Instruct · Hugging Face

PaperBench: Evaluating AI’s Ability to Replicate AI Research

PaperBench: Evaluating AI’s Ability to Replicate AI Research [3.5]
PaperBenchは、AIエージェントが最先端のAI研究を複製する能力を評価するベンチマークである。エージェントは、スクラッチから20個のICML 2024 SpotlightとOralの文書を複製する必要がある。 PaperBenchには8,316の個別の段階的なタスクが含まれている。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 15:55:24 GMT)
OpenAIによる「PaperBench, a benchmark evaluating the ability of AI agents to replicate state-of-the-art AI research.」の提案。
リポジトリはGitHub – openai/preparedness: Releases from OpenAI Preparedness

Inducing Programmatic Skills for Agentic Tasks

Inducing Programmatic Skills for Agentic Tasks [54.0]
本研究では,エージェントがプログラムベースのスキルをその場で誘導し,検証し,活用することで,エージェントの適応を可能にするエージェントスキル誘導(ASI)を提案する。 ASIは静的ベースラインエージェントとテキストスキルを23.5%、成功率11.3%で上回っている。
論文参考訳（メタデータ） (Wed, 09 Apr 2025 12:25:37 GMT)
「We present ASI, namely agent skill induction (§2), that induces and applies programmatic skills along the process of solving user web navigation queries. More concretely, given a natural language (NL) query, the agent first generates an action trajectory attempting to solve the task using built-in, primitive actions such as click and scroll.」という感じでスキルの表現にプログラムコードを用いる手法の提案と有効性の検証。
曖昧さを含め、表現力・抽象化の方法などかなり異なる自然言語と形式言語の使い分けが重要なのかなーと思わなくもない。
リポジトリはGitHub – zorazrw/agent-skill-induction: Agent Skill Induction: “Inducing Programmatic Skills for Agentic Tasks”

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond [88.6]
大規模推論モデル (LRM) は, 推論中におけるチェーン・オブ・ソート (CoT) の推論長を拡大することにより, 高い性能向上を示した。懸念が高まっているのは、過度に長い推論の痕跡を生み出す傾向にある。この非効率性は、トレーニング、推論、現実のデプロイメントに重大な課題をもたらす。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 15:36:30 GMT)
「In this survey, we provide a comprehensive overview of recent efforts aimed at improving reasoning efficiency in LRMs, with a particular focus on the unique challenges that arise in this new paradigm.」というサーベイ。Fugu-MT 論文翻訳(概要): Stop Overthinking: A Survey on Efficient Reasoning for Large Language Modelsでも思ったが新たな手法→新たな課題→包括的サーベイという流れが極めて速い。
リポジトリはGitHub – XiaoYee/Awesome_Efficient_LRM_Reasoning: A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond

Measurement of LLM’s Philosophies of Human Nature

Measurement of LLM’s Philosophies of Human Nature [113.5]
大規模言語モデル(LLM)を対象とする標準化された心理尺度を設計する。現在のLSMは、人間に対する信頼の欠如を示す。本稿では,LLMが継続的に価値体系を最適化できるメンタルループ学習フレームワークを提案する。
論文参考訳（メタデータ） (Thu, 03 Apr 2025 06:22:19 GMT)
「Machinebased Philosophies of Human Nature Scale (M-PHNS)」とLLMの人間性に対する評価を行うツールの提案。「Most models exhibit varying degrees of negative tendencies, such as perceiving humans as untrustworthy, selfish, and volatile. These tendencies intensify as the intelligence level of the model increases. This phenomenon is consistent regardless of the model’s developer or whether the model is open-source.」という結果が面白い。これらを修正するフレームワークも提案しているが、これが良いのかは若干謎。
リポジトリはkodenii/M-PHNS · GitHub

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection

Review, Refine, Repeat: Understanding Iterative Decoding of AI Agents with Dynamic Evaluation and Selection [71.9]
Best-of-N (BON) サンプリングのような推論時間法は、パフォーマンスを改善するための単純で効果的な代替手段を提供する。本稿では,反復的改良と動的候補評価,検証器による選択を併用した反復的エージェント復号(IAD)を提案する。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 17:40:47 GMT)
「In this work, we proposed IAD : an iterative decoding approach for AI agent alignment with black box access which highlights the effectiveness of iterative decoding (guided by a verifier) for these complex agentic tasks.」と（よくある）API利用を前提としたエージェントのパフォーマンス改善手法の提案。

Large Language Model Agent: A Survey on Methodology, Applications and Challenges

Large Language Model Agent: A Survey on Methodology, Applications and Challenges [88.3]
大きな言語モデル(LLM)エージェントは、目標駆動の振る舞いと動的適応能力を持ち、人工知能への重要な経路を示す可能性がある。本調査は, LLMエージェントシステムを方法論中心の分類法により体系的に分解する。私たちの作業は、エージェントの構築方法、コラボレーション方法、時間の経過とともにどのように進化するか、という、統一されたアーキテクチャの視点を提供します。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 12:50:17 GMT)
LLMによって急速に広がるエージェントのサーベイ。「Despite remarkable progress, significant challenges remain, including scalability limitations, memory constraints, reliability concerns, and inadequate evaluation frameworks.」
リポジトリはGitHub – luo-junyu/Awesome-Agent-Papers: Large Language Model Agent: A Survey on Methodology, Applications and Challenges

REALM: A Dataset of Real-World LLM Use Cases

REALM: A Dataset of Real-World LLM Use Cases [69.6]
REALMはRedditやニュース記事から収集された94,000 LLMのユースケースのデータセットである。 RealmはLLMの多様な応用とユーザの人口統計の2つの重要な側面を捉えている。 LLMアプリケーションを分類し、ユーザの職業が使用するアプリケーションの種類とどのように関連しているかを調査する。
論文参考訳（メタデータ） (Mon, 24 Mar 2025 15:39:25 GMT)
「REALM (Real-World Application of Large Language Model Dataset) Dataset」と珍しい視点のデータセット。
プロジェクトサイトはREALM Dataset Dashboard

2025年6月
月	火	水	木	金	土	日
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30