arXiv最新論文の紹介

Model Hemorrhage and the Robustness Limits of Large Language Models

Model Hemorrhage and the Robustness Limits of Large Language Models [119.5]
大規模言語モデル(LLM)は、自然言語処理タスク全体で強力なパフォーマンスを示すが、デプロイメント用に修正された場合、大幅なパフォーマンス低下を経験する。この現象をモデル出血(パラメータ変更とアーキテクチャ変更によるパフォーマンス低下)と定義する。
論文参考訳（メタデータ） (Mon, 31 Mar 2025 10:16:03 GMT)
「Model Hemorrhage refers to the phenomenon where large language models (LLMs) and their extended frameworks (e g , multimodal models) experience performance degradation, robustness weakening, or adaptability failure during training, optimization, deployment, or task adaptation」と典型的にはモデルデプロイ時の量子化で生じる性能劣化などに関する研究

Command A: An Enterprise-Ready Large Language Model

Command A: An Enterprise-Ready Large Language Model [165.9]
コマンドAはエージェント最適化および多言語対応モデルである。クラス内で最高のRetrieval Augmented Generation機能を提供する。
論文参考訳（メタデータ） (Tue, 01 Apr 2025 12:08:07 GMT)
Gemma3, Command A, OLMo 2 32B, ERNIE 4.5 & X1 – arXiv最新論文の紹介のCommand Aの論文。多言語処理に強いモデルだが、言語によって性能に濃淡がある。

Agent S2, Devin 2, Amazon Nova Act, An Illusion of Progress? Assessing the Current State of Web Agents

以前取り上げたAgent Sのバージョン2が出ていた。半年でOS Worldのスコアが20.5から27.0（15Step）に上がっており、ベースモデル（LLM）の性能向上もあるだろうが着実な進化を感じる。Introducing Amazon Nova Act | Amazon AGI Labs、Cognition | Devin 2.0など発表が相次ぎGUI Agent的なLLM based Agentは流行している。

個人のサイトでもfugumt.comはFugu-MT:AgentでAgent化を行っている（OpenManusを使ったサイトへのエージェント組み込み | ぷるーふおぶこんせぷと）。容易にサイトの機能を拡張できることから、今後このようなサイトが増えてくるのではないかと思う(*1)。

そのような中「An Illusion of Progress? Assessing the Current State of Web Agents 」では「Surprisingly, many recent agents, except for Operator, do not outperform the simple SeeAct agent (Zheng et al , 2024) released in early 2024.」とも指摘されている。同論文にもある通り、正しい評価データセットやフレームワークが求められている。

Agent S2: A Compositional Generalist-Specialist Framework for Computer Use Agents [30.3]
コンピュータ利用エージェントは、コンピュータやモバイルデバイスのグラフィカルユーザインタフェース(GUI)と直接対話することで、デジタルタスクを自動化する。本稿では,様々なジェネラリストおよびスペシャリストモデルにまたがって認知的責任を委譲する新しい構成フレームワークであるAgens S2を紹介する。 Agent S2は、3つの著名なコンピュータ使用ベンチマーク上でのSOTA(State-of-the-art)のパフォーマンスを確立する。
論文参考訳（メタデータ） (Tue, 01 Apr 2025 15:40:27 GMT)
Agent S: An Open Agentic Framework that Uses Computers Like a Human – arXiv最新論文の紹介のバージョン2、全般的に性能が上がり様々なベンチマークでSoTAを主張。
リポジトリはGitHub – simular-ai/Agent-S: Agent S: an open agentic framework that uses computers like a human

An Illusion of Progress? Assessing the Current State of Web Agents [49.8]
我々は,Webエージェントの現状を包括的かつ厳密に評価する。結果は、現在のエージェントの能力の非常に異なる描写を描いており、以前報告された結果に過度に最適化されていることを示唆している。オンライン評価ベンチマークであるOnline-Mind2Webを紹介した。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 05:51:29 GMT)
WEBエージェントのためのベンチマーク。「Many recent agents, except for Operator (OpenAI, 2025), underperform the simple SeeAct agent (Zheng et al , 2024) released in early 2024. Even Operator only achieves a success rate of 61%, showing substantial room for improvement.」とのこと。
リポジトリはGitHub – OSU-NLP-Group/Online-Mind2Web

(*1) 動きが面白いのでOpenManusをつかって無理やり対応している。今のところ実用性は疑問だが、近いうちにバージョンアップ予定。

An Approach to Technical AGI Safety and Security

An Approach to Technical AGI Safety and Security [72.8]
我々は、人類を著しく傷つけるのに十分な害のリスクに対処するアプローチを開発する。私たちは、誤用や悪用に対する技術的なアプローチに重点を置いています。これらの成分を組み合わせてAGIシステムの安全性を実現する方法について概説する。
論文参考訳（メタデータ） (Wed, 02 Apr 2025 15:59:31 GMT)
Google DeepmindによるAGI Safetyに関する論文。非常に興味深い内容であり、また、「Timelines: We are highly uncertain about the timelines until powerful AI systems are developed, but crucially, we find it plausible that they will be developed by 2030.」、「Importantly, AI progress does not usually involve large discontinuous jumps in capability assuming continuous increases in inputs (Section 3.5), though the overall pace of progress may accelerate (Section 3.4).」など所々に刺激的な記載がある。

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving

Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving [26.0]
Java、TypeScript、JavaScript、Go、Rust、C、C++をカバーするマルチ言語問題解決ベンチマークであるMulti-SWE-benchを紹介します。これには合計1,632の高品質なインスタンスが含まれており、68のエキスパートアノテータによって2,456の候補から慎重にアノテートされた。 3つの代表的手法を用いて,Multi-SWE-benchに基づく一連の最先端モデルの評価を行った。大規模強化学習(RL)トレーニングデータセットの構築を目的とした,オープンソースコミュニティのMulti-SWE-RLを立ち上げた。
論文参考訳（メタデータ） (Thu, 03 Apr 2025 14:06:17 GMT)
「we introduce a multilingual issue-resolving benchmark, called Multi-SWE-bench, covering Java, TypeScript, JavaScript, Go, Rust, C, and C++.」というある意味多言語なベンチマーク。基本的にOpenHandsの改修版であるMopenHandsが有力に見えるが、言語間で差があるのが興味深い。
- GitHub – All-Hands-AI/OpenHands: 🙌 OpenHands: Code Less, Make More、OpenHandsはIntroducing OpenHands LM 32B — A Strong, Open Coding Agent Modelとコード生成にチューニングしたLLMを作っているのも面白い。
リポジトリはGitHub – multi-swe-bench/multi-swe-bench: Multi-SWE-bench: A Multilingual Benchmark for Issue Resolving、リーダーボードはMulti-SWE-bench
「Multi-SWE-RL is an open-source community aimed at developing high-quality RL training datasets for complex software engineering tasks. Its purpose is to serve as the foundational infrastructure for training fully autonomous agents capable of addressing real-world software engineering challenges, paving the way toward achieving AGI.」とAGIに言及があるのと「In light of these advancements, we are firmly convinced that “scaling RL in real-world environments is the path toward human-like intelligence”.」は熱い。

MARS: Memory-Enhanced Agents with Reflective Self-improvement

MARS: Memory-Enhanced Agents with Reflective Self-improvement [19.0]
本稿では,リフレクティブ自己改善型メモリ強化エージェントを提案する。フレームワークは、User、Assistant、Checkerの3つのエージェントで構成されている。
論文参考訳（メタデータ） (Tue, 25 Mar 2025 02:05:46 GMT)
「we propose the MARS framework, which enhances agents’ self-adjustment and memory management in complex tasks through reflective mechanisms and memory optimization.」
「The MARS framework implements a dual-memory system, consisting of Short-Term Memory (STM) and Long-Term Memory (LTM)」と短期・長期を分けていることが特徴的なエージェンティックなアプローチのメモリ強化フレームワークの提案。

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers?

CLAIMCHECK: How Grounded are LLM Critiques of Scientific Papers? [36.8]
CLAIMCHECKは、NeurIPS 2023と2024のアノテートデータセットであり、OpenReviewから抽出されたレビューである。 CLAIMCHECKは、レビューの弱点に関するMLの専門家によって豊富な注釈が付けられており、論文は、それらが矛盾していると主張しており、また、識別された弱点の妥当性、客観性、タイプに関するきめ細かいラベルも主張している。我々は,CLAIMCHECK が支援する3つのクレーム中心タスクについて,(1) 紛争のクレームに弱点を関連付けること,(2) 弱点のきめ細かいラベルを予測し,その特異性を高めるために弱点を書き換えること,(3) 根拠付き推論で論文のクレームを検証すること,の3つについて,LCM をベンチマークする。
論文参考訳（メタデータ） (Thu, 27 Mar 2025 17:29:45 GMT)
「This work has introduced CLAIMCHECK—a benchmark of reviewer-identified weaknesses in NeurIPS 2023 and 2024 submissions, richly annotated with descriptive labels by experts and grounded in the claims that they dispute in the reviewed papers. Further, we benchmark various LLMs on three novel tasks enabled by CLAIMCHECK—Weakness Labeling and Editing (WLE), Claim Association (CA), and Claim Verification (CV)—all aimed at assisting reviewers during the peer review process.」というベンチマークの提案。現在のLLMにとって難しいタスクとなっている。
リポジトリはhttps://github.com/JHU-CLSP/CLAIMCHECKとのこと

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities

Trade-offs in Large Reasoning Models: An Empirical Analysis of Deliberative and Adaptive Reasoning over Foundational Capabilities [101.8]
近年のLRM(Large Reasoning Models)の進歩は、特殊推論タスクにおいて顕著な性能を示している。議論的推論能力の獲得は, LRMの基礎的能力を大幅に低下させることを示す。適応推論(Zero-Thinking, Less-Thinking, Summary-Thinking)がこれらの欠点を効果的に軽減できることを示します。
論文参考訳（メタデータ） (Sun, 23 Mar 2025 08:18:51 GMT)
「The overall results of different LRMs under the Zero-Thinking, Summary-Thinking and Summary-Thinking-Plus mode for the evaluation of foundational capabilities.」の表5の結果が非常に興味深い。推論にパワーをかければよいというわけでもなく適応型戦略の重要性がよくわかる。
リポジトリはGitHub – SCIR-SC-Qiaoban-Team/FreeEvalLM

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction

PP-DocLayout: A Unified Document Layout Detection Model to Accelerate Large-Scale Data Construction [4.2]
本稿では,文書フォーマットの異なる23種類のレイアウト領域の認識において,高い精度と効率を実現するPP-Docを提案する。この研究は、文書レイアウト解析の最先端技術に加えて、高品質なトレーニングデータを構築するための堅牢なソリューションも提供する。
論文参考訳（メタデータ） (Fri, 21 Mar 2025 15:20:47 GMT)
「we present PPDocLayout, which achieves high precision and efficiency in recognizing 23 types of layout regions across diverse document formats.」と多様なデータに対応可能なレイアウト認識モデルの提案。
リポジトリはPaddleX/README_en.md at release/3.0-rc · PaddlePaddle/PaddleX · GitHub

AdaWorld: Learning Adaptable World Models with Latent Actions

AdaWorld: Learning Adaptable World Models with Latent Actions [76.5]
我々は,効率的な適応を実現する革新的な世界モデル学習手法であるAdaWorldを提案する。主要なアイデアは、世界モデルの事前トレーニング中にアクション情報を統合することである。次に、これらの潜伏行動を条件とした自己回帰的世界モデルを開発する。
論文参考訳（メタデータ） (Mon, 24 Mar 2025 17:58:15 GMT)
「We present AdaWorld, an autoregressive world model that is highly adaptable across various environments. It can readily transfer actions to different contexts and allows efficient adaptation with limited interactions.」というAdaWorldの提案。「AdaWorld consists of two key components: a latent action autoencoder that extracts actions from unlabeled videos, and an autoregressive world model that takes the extracted actions as conditions.」という構成。
リポジトリはAdaWorld

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31