arXiv最新論文の紹介

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs

Reasoning on a Budget: A Survey of Adaptive and Controllable Test-Time Compute in LLMs [45.8]
大規模言語モデル(LLM)は、幅広いタスクを解くことができる汎用エージェントへと急速に進歩してきた。彼らは、タスクの複雑さに関わらず、固定推論時間計算を適用し、しばしば難しいことを考えながら単純な問題を過小評価する。本調査では, LLM推論の計算効率向上を目的とした, 効率的なテスト時間計算戦略の総合的なレビューを行う。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 18:27:42 GMT)
「This survey presents a comprehensive review of efficient test-time compute (TTC) strategies, which aim to improve the computational efficiency of LLM reasoning. We introduce a two-tiered taxonomy that distinguishes between L1 controllability—methods that operate under fixed compute budgets—and L2 adaptiveness—methods that dynamically scale inference based on input difficulty or model confidence.」というサーベイ。
商用モデルでのハイブリッドアプローチも流行っていて色々と苦労している部分なんだろうなと思う。

Predicting thinking time in Reasoning models [42.6]
推論モデルは長く隠れた思考の連鎖を生み出します。ユーザーは、答えを返す前にモデルが推論にどれくらいの時間を費やすかについての洞察がほとんどない。
論文参考訳（メタデータ） (Sun, 29 Jun 2025 15:01:01 GMT)
LRMにおける推論時間の予測に関する報告。
「In this paper, we explore methods for online prediction of thinking time in reasoning models. Our experiments demonstrate that current models encode a notion of progress in their internal representations, with an mlp probe achieving 45% accuracy over 10 classes, moreover the errors appear highly local (MAE 1).」

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

VerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots [45.0]
本研究では,シミュレータや実環境で実行する前に,タスクプランを自動的に検証するアーキテクチャを提案する。このモジュールは、Large Language Modelsの推論機能を使用して、論理的一貫性を評価し、計画の潜在的なギャップを特定する。我々は,タスク計画の信頼性と効率の向上に寄与し,自律システムにおける堅牢な事前実行検証の必要性に対処する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 15:31:36 GMT)
タスク計画の検証のため「In this paper, we propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments. Leveraging Large Language Models (LLMs), our approach consists of two key steps: first, the conversion of natural language instructions into Linear Temporal Logic (LTL), followed by a comprehensive analysis of action sequences.」と形式言語を併用するアプローチの提案。
リポジトリはVerifyLLM: LLM-Based Pre-Execution Task Plan Verification for Robots

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset

Seamless Interaction: Dyadic Audiovisual Motion Modeling and Large-Scale Dataset [112.5]
4000時間以上の対面インタラクション映像の大規模な収集であるSeamless Interactionデータセットを紹介した。このデータセットは、ダイドの具体的ダイナミクスを理解するAIテクノロジの開発を可能にする。そこで我々は,このデータセットを用いて,人間の発話に適応した動作ジェスチャーと表情を生成するモデル群を開発した。
論文参考訳（メタデータ） (Fri, 27 Jun 2025 18:09:49 GMT)
「we introduce the Seamless Interaction Dataset, a large-scale collection of over 4,000 hours of face-to-face interaction footage from over 4,000 participants in diverse contexts. This dataset enables the development of AI technologies that understand dyadic embodied dynamics, unlocking breakthroughs in virtual agents, telepresence experiences, and multimodal content analysis tools.」というデータセット。
リポジトリはGitHub – facebookresearch/seamless_interaction: Foundation Models and Data for Human-Human and Human-AI interactions.

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents

VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents [105.4]
VLM2Vec-V2は、様々な視覚形態にまたがる埋め込みを学習するための統一的なフレームワークである。まず、MMEBを5つの新しいタスクタイプで拡張する包括的なベンチマークであるMMEB-V2を紹介する。次に、テキスト、画像、ビデオ、ビジュアルドキュメント入力をサポートする汎用埋め込みモデルであるVLM2Vec-V2を訓練する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 00:51:57 GMT)
「MMEB-V2, an advanced multimodal embedding dataset designed to train and evaluate embedding models across three key visual modalities: images, videos, and visual documents.」と、それを活用した埋め込みモデルVLM2Vec-V2の提案。かなり汎用的な2vec
プロジェクトサイトはVLM2Vec

ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series

ReTimeCausal: EM-Augmented Additive Noise Models for Interpretable Causal Discovery in Irregular Time Series [32.2]
本稿では, 金融, 医療, 気候科学などの高度領域における不規則サンプル時系列における因果発見について検討する。 ReTimeCausalは,物理誘導型データ計算と疎因性推論を統一する付加雑音モデル(ANM)と期待最大化(EM)の新たな統合である。
論文参考訳（メタデータ） (Fri, 04 Jul 2025 05:39:50 GMT)
不規則にサンプリングされた時系列データを対象としたcausal discovery の報告。「we propose ReTimeCausal (Recovery for Irregular Time- series Causal Discovery). ReTimeCausal integrates Additive Noise Models (ANMs) with an Expectation-Maximization (EM) framework to jointly perform noise-aware data imputation and causal structure learning.」とのこと。

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents

A Survey on Autonomy-Induced Security Risks in Large Model-Based Agents [45.5]
大規模言語モデル(LLM)の最近の進歩は、自律型AIエージェントの台頭を触媒している。これらの大きなモデルエージェントは、静的推論システムからインタラクティブなメモリ拡張エンティティへのパラダイムシフトを示す。
論文参考訳（メタデータ） (Mon, 30 Jun 2025 13:34:34 GMT)
AIエージェントとセキュリティリスクに関するサーベイ。
検討ポイントが多い。。

Scaling RL to Long Videos

Scaling RL to Long Videos [107.4]
LongVILA-R1-7B は VideoMME などの長いビデオ QA ベンチマークで高い性能を発揮する。 LongVILA-R1は、視覚言語モデルにおけるロングビデオ推論に向けての第一歩となる。各種モダリティのRLトレーニングをサポートする,一般公開のためのトレーニングシステムをリリースする。
論文参考訳（メタデータ） (Thu, 10 Jul 2025 17:47:40 GMT)
「(1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling.」を使用しての長い動画を理解するためのフレームワークの提案
「Unlike domains such as math or code reasoning, where structured supervision and benchmarks are readily available [7, 8], long video reasoning requires annotating complex temporal dynamics, goals, spatial relations, and narrative elements—often across minutes or hours of footage」と、コード生成や数学的推論とは異なる難しさがある。
リポジトリはGitHub – NVlabs/Long-RL: Long-RL: Scaling RL to Long Sequences

AI4Research: A Survey of Artificial Intelligence for Scientific Research

AI4Research: A Survey of Artificial Intelligence for Scientific Research [55.5]
我々はAI for Research(AI4Research)に関する総合的な調査を行う。まず、AI4Researchの5つの主要なタスクを分類する系統分類を導入する。主要な研究ギャップを特定し、将来有望な方向性を明らかにする。
論文参考訳（メタデータ） (Wed, 02 Jul 2025 17:19:20 GMT)
ResearchへのAI適用に関するサーベイ。下記を主要タスクとしている。
- (1) AI for Scientific Comprehension
- (2) AI for Academic Surveys
- (3) AI for Scientific Discovery
- (4) AI for Academic Writing
- (5) AI for Academic Reviewing
プロジェクトサイトはAI4Research: A Survey of Artificial Intelligence for Scientific Research

CritiQ: Mining Data Quality Criteria from Human Preferences

CritiQ: Mining Data Quality Criteria from Human Preferences [70.4]
人間の嗜好からデータ品質の基準を自動的にマイニングする新しいデータ選択手法であるCritiQを紹介する。 CritiQ Flowはマネージャエージェントを使用して品質基準を進化させ、ワーカーエージェントはペアで判断する。コード,数学,論理領域において,本手法の有効性を実証する。
論文参考訳（メタデータ） (Mon, 07 Jul 2025 09:58:59 GMT)
「We introduce CritiQ 1, a novel data selection method that automatically mines criteria from human preferences for data quality with only ∼30 human-annotated pairs and performs efficient data selection. The main component, CritiQ Flow, employs a manager agent to evolve quality criteria and worker agents to make pairwise judgments.」と非常に小規模なデータから始めるデータ選択（アノテーション効率化）手法の提案。
リポジトリはGitHub – KYLN24/CritiQ: Repository of the paper ”CritiQ: Mining Data Quality Criteria from Human Preferences”. Code for CritiQ Flow & Training CritiQ Scorer.

GTA1: GUI Test-time Scaling Agent

GTA1: GUI Test-time Scaling Agent [77.6]
本稿ではGUIテストタイムスケーリングエージェントGTA1の2つの課題について検討する。まず、最も適切なアクション提案を選択するために、テスト時間スケーリング手法を提案する。第2に、選択したアクション提案を対応する視覚要素にグラウンドする際の精度の向上を実現するモデルを提案する。
論文参考訳（メタデータ） (Tue, 08 Jul 2025 08:52:18 GMT)
Salesforce researchによるGUIエージェントの提案、OSWorldなどでSoTAを主張
「i) test-time scaling for planning, which introduces a scaling strategy during inference to effectively handle planning ambiguity in complex GUI environments; ii) grounding model training, filtering out training samples with annotation errors to improve supervision quality, and optimizing a grounding model using RL (e g , GRPO) to directly predict coordinates without relying on any intermediate “thinking” (i. e., CoT reasoning) on the derived data.」という工夫を行っている。UI-TARS-1.5-7B, Qwen2.5-VL-32B-Instruct, Qwen2.5-VL-72B-InstructをPost Trainingしているが、やはりこの手のチューニングを行わないと厳しいタスクなのだろうか・・・
リポジトリはGitHub – Yan98/GTA1

2026年8月
月	火	水	木	金	土	日
					1	2
3	4	5	6	7	8	9
10	11	12	13	14	15	16
17	18	19	20	21	22	23
24	25	26	27	28	29	30
31