2025年10月 – arXiv最新論文の紹介

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents

VAGEN: Reinforcing World Model Reasoning for Multi-Turn VLM Agents [130.7]
言語モデル(LLM)エージェントと比較して、視覚言語モデル(VLM)エージェントを訓練する際の重要な課題は、テキスト状態から複雑な視覚観察に移行することである。 VLMエージェントは、明示的な視覚状態推論によって内部世界モデルを構築することができるか? 我々は、強化学習(RL)を通して、エージェントの推論プロセスを建築的に実施し、報奨する。エージェントの状態推定と遷移モデリングへの推論が成功に不可欠であることが分かりました。
論文参考訳（メタデータ） (Sun, 19 Oct 2025 16:05:07 GMT)
「How can we effectively teach VLMs to build internal world models through explicit visual state reasoning?」、「Vision-language Model (VLM) agentic tasks are inherently complex due to the challenges in understanding visual states, which often are partial and noisy Observations, fundamentally reframing the problem from an Markov Decision Process (MDP) to a more challenging Partially Observable Markov Decision Process (POMDP).」というモチベーションからWorld Modelの構築を推進するためのフレームワークを提案。「To optimize an agent’s world model reasoning, we propose turn-level WorldModeling Reward for a dense turn-level reward to evaluate the accuracy of the agent’s internal state simulation against ground-truth; to solve the critical challenge of long-horizon credit assignment, we propose Bi-Level GAE to first computes the value of an entire turn’s reasoning before propagating that credit precisely to the individual tokens. Our VAGEN framework significantly enhances task performance and visual reasoning quality for VLM in agentic tasks.」
プロジェクトサイトはVAGEN – VLM Agent Training

A Survey on Parallel Reasoning

A Survey on Parallel Reasoning [58.7]
まず、並列推論の形式的定義を示し、その区別をChain-of-Thoughtのような関連する概念と明確にする。次に、非対話的推論、対話的推論、効率を重視した復号戦略を含む、新しい分類法に基づく高度な手法を編成し、議論する。並列推論の中核的な課題を強調し,今後の研究の方向性を示唆する。
論文参考訳（メタデータ） (Tue, 14 Oct 2025 05:42:19 GMT)
「The key idea of parallel reasoning is to make multiple attempts in parallel before answering a question and then aggregate the proposed solutions.」というparallel reasoningのサーベイ。
リポジトリはGitHub – PPPP-kaqiu/Awesome-Parallel-Reasoning: Awesome-Parallel-Reasoning: Unlocking the reasoning potential of LLMs. Papers, Code, Resources & Survey.

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain-0: A World Model-Powered Vision-Language-Action Model [44.1]
我々は、世界モデル生成データによって強化された新しいVLA基盤モデルであるGigaBrain-0を紹介する。 GigaBrain-0は、タスク間の一般化を改善しながら、実際のロボットデータへの依存を著しく低減する。また、NVIDIA Jetson AGX Orinのようなデバイス上で効率的に動作するように設計された軽量なGigaBrain-0-Smallも紹介する。
論文参考訳（メタデータ） (Wed, 22 Oct 2025 09:57:13 GMT)
「we presented GigaBrain-0, a vision-language-action model that leverages data generated by world models to overcome the scalability and diversity limitations of real-world robot data collection.」とロボットでの活用を想定した基盤モデル
プロジェクトサイトはGigaBrain-0: A World Model-Powered Vision-LanguageAction Model

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM [128.4]
我々はOmniVinciを紹介します。OmniVinciは強力でオープンソースのOmni-modal LLMを構築するためのイニシアチブです。モデルアーキテクチャでは、(i)OmniAlignNetで視覚とオーディオの埋め込みのアライメントを強化する方法、(ii)視覚と音声信号の時間的アライメントをキャプチャするための時間的エンベディンググループ、(iii)オムニモーダル埋め込みにおける絶対時間的情報をエンコードするための制約付きロータリー時間エンベディングという3つの重要なイノベーションを提示する。
論文参考訳（メタデータ） (Fri, 17 Oct 2025 17:59:59 GMT)
「we introduce a new framework to harmonize vision and audio embeddings in a unified omni-modal embedding space, featuring three new techniques: (i) OmniAlignNet that learns to construct a modality-shared space to align vision and audio embeddings from the same video; (ii) Temporal Embedding Grouping that divides the time dimension into multiple chunks and reorganizes the vision and audio embeddings according to their timestamps to align with the corresponding chunks; (iii) Constrained Rotary Time Embedding to directly insert periodic temporal information into vision-audio embeddings.」とマルチモーダルなLLMの提案
プロジェクトサイトはOmniVinci: Joint Visual-Audio Understanding

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents

When Agents Trade: Live Multi-Market Trading Benchmark for LLM Agents [74.6]
Agent Market Arena (AMA)は、LLM(Large Language Model)ベースのトレーディングエージェントを評価するための、初めてのリアルタイムベンチマークである。 AMAは、検証済みのトレーディングデータ、専門家チェックされたニュース、および統一されたトレーディングフレームワーク内に多様なエージェントアーキテクチャを統合する。 GPT-4o、GPT-4.1、Claude-3.5-haiku、Claude-sonnet-4、Gemini-2.0-flashにまたがる評価する。
論文参考訳（メタデータ） (Mon, 13 Oct 2025 17:54:09 GMT)
トレーニングエージェント評価のための環境
プロジェクトサイトはFinAI、

ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models

ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models [111.3]
時系列を意味のあるシェープレット駆動セグメントに分割する革新的なフレームワークであるShapeXを紹介する。 ShapeXの中核にはShapelet Describe-and-Detectフレームワークがあり、分類に不可欠なさまざまなシェイプレットを効果的に学習する。
論文参考訳（メタデータ） (Thu, 23 Oct 2025 00:01:40 GMT)
時系列分類に関する説明手法、「we introduce SHAPEX, a novel approach that segments the time series into meaningful subsequences and computes Shapley value [13] as saliency scores. Instead of distributing importance across individual timesteps, SHAPEX aggregates timesteps into cohesive, shapelet-driven segments that serve as “players” in the Shapley value computation. By measuring each segment’s marginal contribution to the black-box model’s prediction, this method clearly identifies which subsequences significantly influence classification outcomes.」
リポジトリはGitHub – BosonHwang/ShapeX: Shapelet-Driven Post Hoc Explanations for Time Series Classification Models

Outraged AI: Large language models prioritise emotion over cost in fairness enforcement

Outraged AI: Large language models prioritise emotion over cost in fairness enforcement [13.5]
我々は,大言語モデル (LLM) が感情を用いて罰を導いていることを示す。不公平はより強い否定的な感情をもたらし、より多くの罰を導いた。将来のモデルでは、人間のような感情的知性を達成するために、感情を文脈に敏感な推論と統合すべきである。
論文参考訳（メタデータ） (Fri, 17 Oct 2025 08:41:36 GMT)
third-party punishment (TPP) gameを用いたLLMの分析。「This emotion–behaviour coupling was robust and even stronger than humans across reasoning models (o3-mini, DeepSeek-R1) and an advanced foundation model (DeepSeek-V3), with the older GPT-3.5 baseline showing a weaker and less consistent effect. Analyses of the model's rationales further corroborated that elicited emotions were invoked before punitive choices (e g , references to anger in DeepSeek-R1), consistent with emotion-guided decision processes.」、「reasoning LLMs reported stronger affect to unfairness, and prioritised emotion over fairness and cost, whereas humans weighted fairness and cost more heavily75. These dissociations indicate that current LLMs have not fully internalised the human-like cost–benefit calculus that tempers norm enforcement.」など面白い結果が多い。LLM/LRMを高度な分野に使っていくにあたっては人間との差異があることを大前提として注意深い評価が必要なんだろうと思う。
「Corrupted by Reasoning: Reasoning Language Models Become Free-Riders in Public Goods Games, How large language models judge and influence human cooperation – arXiv最新論文の紹介」でも思ったがこの手の研究はとても興味深い。

Fundamentals of Building Autonomous LLM Agents

Fundamentals of Building Autonomous LLM Agents [64.4]
本稿では,大規模言語モデル(LLM)を用いたエージェントのアーキテクチャと実装手法について概説する。この研究は、複雑なタスクを自動化し、人間の能力でパフォーマンスのギャップを埋めることのできる「アジェンティック」なLLMを開発するためのパターンを探求することを目的としている。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 10:32:39 GMT)
「This paper is based on a seminar technical report from the course Trends in Autonomous Agents: Advances in Architecture and Practice offered at TUM.」とエージェント構築における教科書的な内容。

World-in-World: World Models in a Closed-Loop World

World-in-World: World Models in a Closed-Loop World [123.9]
我々は,実エージェントと環境の相互作用を反映したクローズドループの世界において,世界モデルをベンチマークする最初のオープンプラットフォームであるWorld-in-Worldを紹介した。多様なWMを厳格に評価し、タスク成功を主要な指標として優先順位付けし、視覚的品質に重点を置く4つのクローズドループ環境をキュレートする。 1)視覚的品質だけではタスクの成功は保証されないが、制御可能性の方が重要であること、2) 行動観測データによる後トレーニングのスケーリングは、事前訓練されたビデオジェネレータをアップグレードするよりも効果的であること、3) 推論時計算の割り当てにより、WMsは大幅にクローズドな改善が可能であること、の3つのサプライズを明らかにした。
論文参考訳（メタデータ） (Mon, 20 Oct 2025 22:09:15 GMT)
World model としてのViusual Generationモデルに対してのベンチマーク。VisualなクオリティとWorld modelとしてのクオリティにはギャップがあるとの指摘。
- We introduce World-in-World, the first comprehensive closed-loop benchmark that evaluates world models through the lens of embodied interaction, moving beyond the common focus on generation quality. • We propose a unified closed-loop planning strategy with a unified action API, allowing diverse world models to be seamlessly integrated and evaluated within a single framework across four embodied tasks.
- We introduce World-in-World, the first comprehensive closed-loop benchmark that evaluates world models through the lens of embodied interaction, moving beyond the common focus on generation quality.
- We propose a unified closed-loop planning strategy with a unified action API, allowing diverse world models to be seamlessly integrated and evaluated within a single framework across four embodied tasks.
- We discover that high visual quality does not necessarily guarantee task success, and demon- strate how the performance of pretrained video generators can be substantially improved through training-time data scaling and inference-time scaling.
プロジェクトサイトはWorld-in-World: World Models in a Closed-Loop World

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition

LM Fight Arena: Benchmarking Large Multimodal Models via Game Competition [104.8]
本稿では,Mortal Kombat IIにおける大規模マルチモーダルモデルを評価する新しいフレームワークであるLM Fight Arenaを紹介する。静的評価とは異なり、LM Fight Arenaは完全に自動化され、再現可能で、LMMの戦略的推論能力の客観的評価を提供する。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 02:19:21 GMT)
「Unlike static evaluations, LM Fight Arena provides a fully automated, reproducible, and objective assessment of an LMM’s strategic reasoning capabilities in a dynamic setting. This work introduces a challenging and engaging benchmark that bridges the gap between AI evaluation and interactive entertainment.」とのことだが、なぜにMortal Kombat…
Claude 3.5 Sonnetがとても強いらしい。

2025年10月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31