2025年9月 – ページ 6 – arXiv最新論文の紹介

Qwen3-Max, K2-Instruct-0905, LongCat-Flash, Dream-Coder 7B, Kwai Keye-VL 1.5

先週もLLM/LRM界隈のニュースは多かった。Qwen3系最大構成のQwen3 Maxの公開（XユーザーのQwenさん: 「Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters! 🚀 Now available via Qwen Chat & Alibaba Cloud API. Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm: https://t.co/7vQTfHup1Z」 / X、Models and pricing – Alibaba Cloud Model Studio – Alibaba Cloud Documentation Center）、Kimi K2のアップデート（XユーザーのKimi.aiさん: 「Kimi K2-0905 update 🚀 – Enhanced coding capabilities, esp. front-end & tool-calling – Context length extended to 256k tokens – Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: https://t.co/83sQekosr9 💬 Chat with new Kimi https://t.co/mkOuBMwzpw」 / X、moonshotai/Kimi-K2-Instruct-0905 · Hugging Face）やLongCat-Flashの他、Dream-Coder 7B、Kwai Keye-VL 1.5など小規模でもユニークなモデルも発表されている。

Introduction – Agent Client Protocol（GitHub – zed-industries/agent-client-protocol: A protocol for connecting any editor to any agent）といったプロトコルの提案など周辺領域にも目が離せない。

LongCat-Flash Technical Report [165.7]
LongCat-Flashは、560ビリオンパラメータのMixture-of-Experts (MoE)言語モデルである。計算効率と高度なエージェント能力の両方のために設計されている。 30日以内に20兆トークン以上のモデルトレーニングを完了し、100トークン/秒 (TPS) 以上の推論を0.70パーセントのアウトプットトークンで達成しました。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 10:05:45 GMT)
560B MoE構成、「As a non-thinking model, LongCat-Flash achieves performance comparable to state-of-the-art non-thinking models, including DeepSeek-V3.1 [DeepSeek-AI et al , 2025] and Kimi-K2 [Team et al , 2025], while using fewer parameters and offering faster inference speed. Specifically, LongCat-Flash scores 86.5 on ArenaHard-V2, 39.5 on TerminalBench, and 67.7 on τ 2-Bench, demonstrating robust capabilities in general domains, coding, and agentic tool use.」
リポジトリはGitHub – meituan-longcat/LongCat-Flash-Chat

Dream-Coder 7B: An Open Diffusion Language Model for Code [99.1]
そこで,Dream-Coder 7Bを提案する。Dream-Coder 7Bは,任意の順序生成能力を示すコード生成のための,オープンソースの離散拡散言語モデルである。厳密に左から右にデコードする従来の自己回帰(AR)モデルとは異なり、ドリームコーダ7Bはコーディングタスクに基づいてデコード戦略を適応的に決定する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 05:30:56 GMT)
コーディングタスク強化の拡散モデル
リポジトリはGitHub – DreamLM/Dream-Coder

Kwai Keye-VL 1.5 Technical Report [91.3]
本稿では、ビデオ理解における根本的な課題を3つの重要なイノベーションを通じて解決するKeye-VL-1.5を紹介する。まず,フレーム間の類似性に基づいて動的に計算資源を割り当てるSlow-Fastビデオ符号化方式を提案する。次に,モデルのコンテキスト長を8Kから128Kまで体系的に拡張する4段階事前学習手法を提案する。第3に、推論の強化と人間の嗜好の整合性に焦点を当てた総合的な後学習パイプラインを開発する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 15:46:58 GMT)
「Keye-VL-1.5-8B establishes new state-of-the-art performance among models of similar scale, demonstrating superior results on video-centric benchmarks while maintaining competitive performance on general multimodal and reasoning tasks.」とビデオを扱えるモデル
リポジトリはGitHub – Kwai-Keye/Keye

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [151.0]
グラフィカルユーザインタフェースのための自律エージェントの開発は、人工知能における大きな課題を示している。本稿では,GUI中心のエージェントモデルであるUI-TARS-2を提案する。実証的な評価では、UI-TARS-2は以前のUI-TARS-1.5よりも大幅に改善されている。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 17:44:45 GMT)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介, UFO2: The Desktop AgentOS , UI-TARS-1.5 – arXiv最新論文の紹介のアップデート。「Empirical evaluation shows that UI-TARS-2 delivers significant improvements over UI-TARS-1.5 [56], achieving strong results in both GUI-based interaction and game environments. On GUI benchmarks, the model reaches 88.2 on Online-Mind2Web [77], 47.5 on OSWorld [75], 50.6 on WindowsAgentArena [10], and 73.3 on AndroidWorld [52], representing clear gains over the previous generation and outperforming strong baselines such as Claude and OpenAI agents in multiple cases.」と前回モデルに比べ大きな改善を主張。下記が改善点ということではあるが、最初のバージョンからやれることは全部やるという雰囲気がすごい
- First, to mitigate data scarcity, we design a scalable Data Flywheel that co-evolves the model and its training corpus through continual pretraining, supervised fine-tuning, rejection sampling, and multiturn RL
- Second, to overcome the difficulties of scalable multi-turn RL, we design a training framework that stabilizes optimization in long-horizon settings.
- Third, to move beyond the limitations of pure GUI interaction, we construct a hybrid GUI-centered environment that augments on-screen actions with access to complementary resources such as file systems, terminals, and other external tools, enabling agents to solve a broader spectrum of realistic workflows.
- Fourth, to support large-scale training and evaluation, we build a unified sandbox platform capable of orchestrating heterogeneous environments—ranging from cloud VMs for GUI interaction to browser-based sandboxes for games—under a consistent API.
リポジトリはGitHub – bytedance/UI-TARS

On the Theoretical Limitations of Embedding-Based Retrieval

On the Theoretical Limitations of Embedding-Based Retrieval [15.8]
クエリの結果として返却可能なドキュメントの上位kサブセットの数は,埋め込みの次元によって制限されていることを示す。次に、LIMITと呼ばれる現実的なデータセットを作成し、これらの理論結果に基づいてモデルをテストする。我々の研究は、既存の単一ベクトルパラダイムの下での埋め込みモデルの限界を示している。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 17:43:53 GMT)
embeddingを用いた情報抽出の限界を示した論文。「the critical-n values (for embedding size): 500k (512), 1.7m (768), 4m (1024), 107m (3072), 250m (4096). We note that this is the best case: a real embedding model cannot directly optimize the query and document vectors to match the test qrel matrix (and is constrained by factors such as “modeling natural language”). However, these numbers already show that for web-scale search, even the largest embedding dimensions with ideal test-set optimization are not enough to model all combinations.」（The critical-n value where the dimensionality is too small to successfully represent all the top-2 combinations.）と意外と制約が厳しい。
リポジトリはGitHub – google-deepmind/limit: On the Theoretical Limitations of Embedding-Based Retrieval

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.7]
LLaVA-Critic-R1は高い評価を受けた批評家としてだけでなく、競争政策モデルとしても現れることを示す。テスト時に自己批判を適用すると、5つの代表的な推論タスクに対して平均+13.8%の改善が得られる。その結果,評価と生成の両面において優れた統一モデルが得られることがわかった。
論文参考訳（メタデータ） (Sun, 31 Aug 2025 03:08:02 GMT)
「experimental results across massive visual benchmarks demonstrate that critic training not only substantially enhances the critic capabilities of VLMs, but also improves their performance as a general policy across a wide range of visual understanding and reasoning tasks. This dual improvement enables LLaVA- Critic-R1 to outperform other visual reasoning models trained with in-domain policy training, establishing it」という報告。強い関連はあると思いつつ面白い挙動。
リポジトリはLLaVA-NeXT/llava-critic-r1 at main · LLaVA-VL/LLaVA-NeXT · GitHub、LLaVA-NeXT/llava-critic-r1 at main · LLaVA-VL/LLaVA-NeXT · GitHub

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? [71.2]
大規模言語モデル（LLM）を基盤としたエージェントシステムは、複数のモデルやツールを駆使して高い性能を発揮するが、その複雑性によりシステムの脆弱性が増し、誤動作が発生しやすくなる。これに対処するため、AgenTracerが提案され、失敗したマルチエージェントの軌跡を自動で注釈付けし、エラー診断が可能な新しい軽量トレーサーAgenTracer-8Bが開発された。このシステムは、既存の大規模言語モデルを上回る性能を示し、エージェントの自己修正や進化を促す実用的なフィードバックを提供する。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 13:42:14 GMT)
LLM based agents開発で大問題となるtrajectory logsの分析に関する研究、「By introducing AgenTracer, we provide the first automated framework capable of systematically generating annotated failure trajectories, as well as AgenTracer-8B, a lightweight yet effective failure tracer that leverages multi-granular RL to achieve prevailing diagnostic accuracy.」とのこと。AgenTracer-8BはQWEN3-8BをPost traigninしたモデルでサイズの割にとても高性能とのこと。
プロジェクトサイトはAcademic Project Page、リポジトリはGitHub – bingreeky/AgenTracer: AgenTracer: A Lightweight Failure Attributor for Agentic Systems

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.8]
大言語モデル(LLM)は、単純なテキストジェネレータから、検索強化、ツール呼び出し、マルチターンインタラクションを統合する複雑なソフトウェアシステムへと進化してきた。その固有の非決定主義、ダイナミズム、文脈依存は品質保証に根本的な課題をもたらす。本稿では,LLMアプリケーションを3層アーキテクチャに分解する:システムシェル層、プロンプトオーケストレーション層、およびLLM推論コア
論文参考訳（メタデータ） (Thu, 28 Aug 2025 13:00:28 GMT)
LLMを用いたソフトウェアに対するテストのサーベイ
conclusionに「A key insight is that LLM application testing is neither a mere extension of traditional software testing nor a straightforward application of AI-security techniques.」とある通り、LLM活用のソフトウェアは動的・確率的な動作にならざるを得ないためテスト手法はかなり変わるよう。

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution [13.6]
大規模言語モデル(LLM)は、知識や推論能力をテストするタスクで一般的に評価される。本稿では、モデルが出力の特性を予測できる能力を測定するセルフ実行ベンチマークを紹介する。私たちの実験では、モデルが一般的にこのベンチマークではパフォーマンスが悪く、モデルのサイズや能力が向上しても、常にパフォーマンスが向上するとは限らないことが示されています。
論文参考訳（メタデータ） (Sun, 17 Aug 2025 07:57:58 GMT)
「Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model’s ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this bench- mark, and that increased model size or capability does not consistently lead to better performance.」という変わったベンチマーク。メタな視点になっていて結果を含めとても興味深い。
リポジトリはGitHub – anon-researcher-2025/Self-Execution-Benchmark

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs [16.6]
PosterGenはプロのポスターデザイナーのワークフローを反映したマルチエージェントフレームワークである。意味的に根拠があり、視覚的に魅力的であるポスターを制作する。実験の結果,PosterGenはコンテントの忠実度に一貫して一致し,ビジュアルデザインの既存手法よりも優れていた。
論文参考訳（メタデータ） (Sun, 24 Aug 2025 02:25:45 GMT)
論文からポスターを生成するマルチエージェントフレームワークの提案
リポジトリはGitHub – Y-Research-SBU/PosterGen: Official Code for PosterGen

Mimicking the Physicist’s Eye:A VLM-centric Approach for Physics Formula Discovery

Mimicking the Physicist’s Eye:A VLM-centric Approach for Physics Formula Discovery [98.6]
VIPERR-aq1は、方程式推論のための視覚誘導を行うマルチモーダルモデルである。視覚知覚、軌跡データ、象徴的推論を統合し、科学的発見過程をエミュレートする。常に最先端のVLMベースラインを精度と解釈性で上回る。
論文参考訳（メタデータ） (Sun, 24 Aug 2025 14:34:21 GMT)
物理方程式発見タスクへの取り組み。PostTrainingによってフロンティアなモデルを超える性能。「Our framework draws inspiration from human scientific reasoning and follows a two-stage pipeline. In the first stage, Motion Structure Induction (MSI), the model undergoes Supervised Fine- Tuning (SFT), learning to interpret kinematic evidence under joint supervision of Chain-of-Thought (CoT) rationales and ground-truth equations, before producing initial symbolic hypotheses guided by causal CoT prompts. In the second stage, Reward-Guided Symbolic Calibration (RGSC), reinforcement learning with Group Relative Policy Optimization (GRPO) (Shao et al , 2024) re- fines these hypotheses using a structural reward function that favors topological correctness over」というフレームワークとのこと。
プロジェクトサイトはVIPER-R1: Mimicking the Physicist’s Eye

A Survey on Large Language Model Benchmarks

A Survey on Large Language Model Benchmarks [45.0]
一般的な能力ベンチマークは、中核言語学、知識、推論などの側面をカバーする。ドメイン固有のベンチマークは、自然科学、人文科学、社会科学、エンジニアリング技術といった分野に焦点を当てている。ターゲット固有のベンチマークは、リスク、信頼性、エージェントなどに注意を払う。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 08:43:35 GMT)
「We systematically review the current status and development of large language model benchmarks for the first time, categorizing 283 representative benchmarks into three categories: general capabilities, domain- specific, and target-specific.」とベンチマークのサーベイ
LLMの動きを広範に知るため様々なベンチマークが作られており、これら調査は非常にありがたい。

2025年9月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30