staka – ページ 23 – arXiv最新論文の紹介

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers

Loong: Synthesize Long Chain-of-Thoughts at Scale through Verifiers [103.4]
スケーラブルな合成データ生成と検証のためのオープンソースのフレームワークであるLoong Projectを紹介します。 LoongBenchは、12のドメインにまたがる8,729の人為的なサンプルを含む、キュレートされたシードデータセットである。 LoongEnvはモジュラー合成データ生成環境であり、新しい質問応答コードのトリプルを生成する複数のプロンプト戦略をサポートする。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 06:42:40 GMT)
「Our contributions are fourfold: (1) LOONGBENCH, a seed dataset of 8,729 examples across 12 reasoning- intensive domains with executable code and verified answers; (2) LOONGENV, a flexible environment enabling diverse synthetic data generation strategies; (3) comprehensive benchmarking of open-source and proprietary models to assess domain generalization; and (4) in-depth analysis of generated data quality in terms of correctness, diversity, and complexity. Together, these components form a cohesive framework for studying alignment at scale.」と、合成データに関するフレームワークの提案。合成データ活用は高性能モデルを構築するにあたり基本的なアプローチとなっており、この手のフレームワークはありがたい。
リポジトリはGitHub – camel-ai/loong: 🐉 Loong: Synthesize Long CoTs at Scale through Verifiers.

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data

Strefer: Empowering Video LLMs with Space-Time Referring and Reasoning via Synthetic Instruction Data [100.5]
Streferはビデオ大モデルに参照と推論機能を持たせるために設計された合成データ生成フレームワークである。 Streferは、時間的に密度が高くきめ細かなビデオメタデータを擬似アノテーションするデータエンジンを使用して、多様な命令生成データを生成する。我々のアプローチは、ビデオLLMが空間的および時間的参照を解釈する能力を高め、現実のAIコンパニオンに不可欠な、より汎用的で時空間対応の推論を育む。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 17:33:20 GMT)
「Our approach begins with a modular framework that orchestrates multiple agents—including pretrained Large Language Models (LLMs), Video LLMs, and Pixel-Level Multimodal Vision Foundation Models (e g , RexSeek [20], GroundingDINO [32] and SAM2 [44])—to pseudo-annotate video metadata with temporally dense and object-centric space-time information. This metadata captures detailed spatial and temporal structures, such as subjects, objects, their locations as masklets (segmentation masks tracked over time), and action timelines. Building on this structured metadata, we leverage in-context learning and well-defined task schemas to guide LLMs in generating high-utility instruction data for tuning Video LLMs.」と凝った構成による動画に対する合成データフレームワークの提案。
プロジェクトサイトはStrefer: Data Engine for Video LLMs

Qwen3-Max, K2-Instruct-0905, LongCat-Flash, Dream-Coder 7B, Kwai Keye-VL 1.5

先週もLLM/LRM界隈のニュースは多かった。Qwen3系最大構成のQwen3 Maxの公開（XユーザーのQwenさん: 「Big news: Introducing Qwen3-Max-Preview (Instruct) — our biggest model yet, with over 1 trillion parameters! 🚀 Now available via Qwen Chat & Alibaba Cloud API. Benchmarks show it beats our previous best, Qwen3-235B-A22B-2507. Internal tests + early user feedback confirm: https://t.co/7vQTfHup1Z」 / X、Models and pricing – Alibaba Cloud Model Studio – Alibaba Cloud Documentation Center）、Kimi K2のアップデート（XユーザーのKimi.aiさん: 「Kimi K2-0905 update 🚀 – Enhanced coding capabilities, esp. front-end & tool-calling – Context length extended to 256k tokens – Improved integration with various agent scaffolds (e.g., Claude Code, Roo Code, etc) 🔗 Weights & code: https://t.co/83sQekosr9 💬 Chat with new Kimi https://t.co/mkOuBMwzpw」 / X、moonshotai/Kimi-K2-Instruct-0905 · Hugging Face）やLongCat-Flashの他、Dream-Coder 7B、Kwai Keye-VL 1.5など小規模でもユニークなモデルも発表されている。

Introduction – Agent Client Protocol（GitHub – zed-industries/agent-client-protocol: A protocol for connecting any editor to any agent）といったプロトコルの提案など周辺領域にも目が離せない。

LongCat-Flash Technical Report [165.7]
LongCat-Flashは、560ビリオンパラメータのMixture-of-Experts (MoE)言語モデルである。計算効率と高度なエージェント能力の両方のために設計されている。 30日以内に20兆トークン以上のモデルトレーニングを完了し、100トークン/秒 (TPS) 以上の推論を0.70パーセントのアウトプットトークンで達成しました。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 10:05:45 GMT)
560B MoE構成、「As a non-thinking model, LongCat-Flash achieves performance comparable to state-of-the-art non-thinking models, including DeepSeek-V3.1 [DeepSeek-AI et al , 2025] and Kimi-K2 [Team et al , 2025], while using fewer parameters and offering faster inference speed. Specifically, LongCat-Flash scores 86.5 on ArenaHard-V2, 39.5 on TerminalBench, and 67.7 on τ 2-Bench, demonstrating robust capabilities in general domains, coding, and agentic tool use.」
リポジトリはGitHub – meituan-longcat/LongCat-Flash-Chat

Dream-Coder 7B: An Open Diffusion Language Model for Code [99.1]
そこで,Dream-Coder 7Bを提案する。Dream-Coder 7Bは,任意の順序生成能力を示すコード生成のための,オープンソースの離散拡散言語モデルである。厳密に左から右にデコードする従来の自己回帰(AR)モデルとは異なり、ドリームコーダ7Bはコーディングタスクに基づいてデコード戦略を適応的に決定する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 05:30:56 GMT)
コーディングタスク強化の拡散モデル
リポジトリはGitHub – DreamLM/Dream-Coder

Kwai Keye-VL 1.5 Technical Report [91.3]
本稿では、ビデオ理解における根本的な課題を3つの重要なイノベーションを通じて解決するKeye-VL-1.5を紹介する。まず,フレーム間の類似性に基づいて動的に計算資源を割り当てるSlow-Fastビデオ符号化方式を提案する。次に,モデルのコンテキスト長を8Kから128Kまで体系的に拡張する4段階事前学習手法を提案する。第3に、推論の強化と人間の嗜好の整合性に焦点を当てた総合的な後学習パイプラインを開発する。
論文参考訳（メタデータ） (Mon, 01 Sep 2025 15:46:58 GMT)
「Keye-VL-1.5-8B establishes new state-of-the-art performance among models of similar scale, demonstrating superior results on video-centric benchmarks while maintaining competitive performance on general multimodal and reasoning tasks.」とビデオを扱えるモデル
リポジトリはGitHub – Kwai-Keye/Keye

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning

UI-TARS-2 Technical Report: Advancing GUI Agent with Multi-Turn Reinforcement Learning [151.0]
グラフィカルユーザインタフェースのための自律エージェントの開発は、人工知能における大きな課題を示している。本稿では,GUI中心のエージェントモデルであるUI-TARS-2を提案する。実証的な評価では、UI-TARS-2は以前のUI-TARS-1.5よりも大幅に改善されている。
論文参考訳（メタデータ） (Tue, 02 Sep 2025 17:44:45 GMT)
UI-TARS: Pioneering Automated GUI Interaction with Native Agents – arXiv最新論文の紹介, UFO2: The Desktop AgentOS , UI-TARS-1.5 – arXiv最新論文の紹介のアップデート。「Empirical evaluation shows that UI-TARS-2 delivers significant improvements over UI-TARS-1.5 [56], achieving strong results in both GUI-based interaction and game environments. On GUI benchmarks, the model reaches 88.2 on Online-Mind2Web [77], 47.5 on OSWorld [75], 50.6 on WindowsAgentArena [10], and 73.3 on AndroidWorld [52], representing clear gains over the previous generation and outperforming strong baselines such as Claude and OpenAI agents in multiple cases.」と前回モデルに比べ大きな改善を主張。下記が改善点ということではあるが、最初のバージョンからやれることは全部やるという雰囲気がすごい
- First, to mitigate data scarcity, we design a scalable Data Flywheel that co-evolves the model and its training corpus through continual pretraining, supervised fine-tuning, rejection sampling, and multiturn RL
- Second, to overcome the difficulties of scalable multi-turn RL, we design a training framework that stabilizes optimization in long-horizon settings.
- Third, to move beyond the limitations of pure GUI interaction, we construct a hybrid GUI-centered environment that augments on-screen actions with access to complementary resources such as file systems, terminals, and other external tools, enabling agents to solve a broader spectrum of realistic workflows.
- Fourth, to support large-scale training and evaluation, we build a unified sandbox platform capable of orchestrating heterogeneous environments—ranging from cloud VMs for GUI interaction to browser-based sandboxes for games—under a consistent API.
リポジトリはGitHub – bytedance/UI-TARS

On the Theoretical Limitations of Embedding-Based Retrieval

On the Theoretical Limitations of Embedding-Based Retrieval [15.8]
クエリの結果として返却可能なドキュメントの上位kサブセットの数は,埋め込みの次元によって制限されていることを示す。次に、LIMITと呼ばれる現実的なデータセットを作成し、これらの理論結果に基づいてモデルをテストする。我々の研究は、既存の単一ベクトルパラダイムの下での埋め込みモデルの限界を示している。
論文参考訳（メタデータ） (Thu, 28 Aug 2025 17:43:53 GMT)
embeddingを用いた情報抽出の限界を示した論文。「the critical-n values (for embedding size): 500k (512), 1.7m (768), 4m (1024), 107m (3072), 250m (4096). We note that this is the best case: a real embedding model cannot directly optimize the query and document vectors to match the test qrel matrix (and is constrained by factors such as “modeling natural language”). However, these numbers already show that for web-scale search, even the largest embedding dimensions with ideal test-set optimization are not enough to model all combinations.」（The critical-n value where the dimensionality is too small to successfully represent all the top-2 combinations.）と意外と制約が厳しい。
リポジトリはGitHub – google-deepmind/limit: On the Theoretical Limitations of Embedding-Based Retrieval

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model

LLaVA-Critic-R1: Your Critic Model is Secretly a Strong Policy Model [99.7]
LLaVA-Critic-R1は高い評価を受けた批評家としてだけでなく、競争政策モデルとしても現れることを示す。テスト時に自己批判を適用すると、5つの代表的な推論タスクに対して平均+13.8%の改善が得られる。その結果,評価と生成の両面において優れた統一モデルが得られることがわかった。
論文参考訳（メタデータ） (Sun, 31 Aug 2025 03:08:02 GMT)
「experimental results across massive visual benchmarks demonstrate that critic training not only substantially enhances the critic capabilities of VLMs, but also improves their performance as a general policy across a wide range of visual understanding and reasoning tasks. This dual improvement enables LLaVA- Critic-R1 to outperform other visual reasoning models trained with in-domain policy training, establishing it」という報告。強い関連はあると思いつつ面白い挙動。
リポジトリはLLaVA-NeXT/llava-critic-r1 at main · LLaVA-VL/LLaVA-NeXT · GitHub、LLaVA-NeXT/llava-critic-r1 at main · LLaVA-VL/LLaVA-NeXT · GitHub

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems?

AgenTracer: Who Is Inducing Failure in the LLM Agentic Systems? [71.2]
大規模言語モデル（LLM）を基盤としたエージェントシステムは、複数のモデルやツールを駆使して高い性能を発揮するが、その複雑性によりシステムの脆弱性が増し、誤動作が発生しやすくなる。これに対処するため、AgenTracerが提案され、失敗したマルチエージェントの軌跡を自動で注釈付けし、エラー診断が可能な新しい軽量トレーサーAgenTracer-8Bが開発された。このシステムは、既存の大規模言語モデルを上回る性能を示し、エージェントの自己修正や進化を促す実用的なフィードバックを提供する。
論文参考訳（メタデータ） (Wed, 03 Sep 2025 13:42:14 GMT)
LLM based agents開発で大問題となるtrajectory logsの分析に関する研究、「By introducing AgenTracer, we provide the first automated framework capable of systematically generating annotated failure trajectories, as well as AgenTracer-8B, a lightweight yet effective failure tracer that leverages multi-granular RL to achieve prevailing diagnostic accuracy.」とのこと。AgenTracer-8BはQWEN3-8BをPost traigninしたモデルでサイズの割にとても高性能とのこと。
プロジェクトサイトはAcademic Project Page、リポジトリはGitHub – bingreeky/AgenTracer: AgenTracer: A Lightweight Failure Attributor for Agentic Systems

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol

Rethinking Testing for LLM Applications: Characteristics, Challenges, and a Lightweight Interaction Protocol [83.8]
大言語モデル(LLM)は、単純なテキストジェネレータから、検索強化、ツール呼び出し、マルチターンインタラクションを統合する複雑なソフトウェアシステムへと進化してきた。その固有の非決定主義、ダイナミズム、文脈依存は品質保証に根本的な課題をもたらす。本稿では,LLMアプリケーションを3層アーキテクチャに分解する:システムシェル層、プロンプトオーケストレーション層、およびLLM推論コア
論文参考訳（メタデータ） (Thu, 28 Aug 2025 13:00:28 GMT)
LLMを用いたソフトウェアに対するテストのサーベイ
conclusionに「A key insight is that LLM application testing is neither a mere extension of traditional software testing nor a straightforward application of AI-security techniques.」とある通り、LLM活用のソフトウェアは動的・確率的な動作にならざるを得ないためテスト手法はかなり変わるよう。

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution

The Self-Execution Benchmark: Measuring LLMs’ Attempts to Overcome Their Lack of Self-Execution [13.6]
大規模言語モデル(LLM)は、知識や推論能力をテストするタスクで一般的に評価される。本稿では、モデルが出力の特性を予測できる能力を測定するセルフ実行ベンチマークを紹介する。私たちの実験では、モデルが一般的にこのベンチマークではパフォーマンスが悪く、モデルのサイズや能力が向上しても、常にパフォーマンスが向上するとは限らないことが示されています。
論文参考訳（メタデータ） (Sun, 17 Aug 2025 07:57:58 GMT)
「Since LLMs lack the ability to execute themselves, we introduce the Self-Execution Benchmark, which measures a model’s ability to anticipate properties of its output, such as whether a question will be difficult for it, whether it will refuse to answer, or what kinds of associations it is likely to produce. Our experiments show that models generally perform poorly on this bench- mark, and that increased model size or capability does not consistently lead to better performance.」という変わったベンチマーク。メタな視点になっていて結果を含めとても興味深い。
リポジトリはGitHub – anon-researcher-2025/Self-Execution-Benchmark

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs

PosterGen: Aesthetic-Aware Paper-to-Poster Generation via Multi-Agent LLMs [16.6]
PosterGenはプロのポスターデザイナーのワークフローを反映したマルチエージェントフレームワークである。意味的に根拠があり、視覚的に魅力的であるポスターを制作する。実験の結果,PosterGenはコンテントの忠実度に一貫して一致し,ビジュアルデザインの既存手法よりも優れていた。
論文参考訳（メタデータ） (Sun, 24 Aug 2025 02:25:45 GMT)
論文からポスターを生成するマルチエージェントフレームワークの提案
リポジトリはGitHub – Y-Research-SBU/PosterGen: Official Code for PosterGen

2025年12月
月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31