arXiv – ページ 14 – arXiv最新論文の紹介

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs

UniLION: Towards Unified Autonomous Driving Model with Linear Group RNNs [115.9]
UniLIONは、大規模なLiDAR点雲、高解像度のマルチビュー画像、さらには時間的シーケンスを効率的に処理する。 UniLIONは、幅広いコアタスクにわたって、競争力と最先端のパフォーマンスを一貫して提供します。
論文参考訳（メタデータ） (Mon, 03 Nov 2025 17:24:19 GMT)
「We propose UniLION, a unified model that achieves both latent temporal fusion and multimodal fusion in UniLION backbone by the linear group RNN, generating the unified BEV features that serve all autonomous driving tasks, including perception, prediction, and planning.」とRNNベースのマルチモーダルモデルの提案。「Unified Heterogeneous Inputs: Leveraging the superior long-range modeling capability and linear computational complexity of linear group RNNs, UniLION integrates multi-view images, LiDAR point clouds, and temporal information into a unified 3D backbone through direct token concatenation, eliminating hand-crafted fusion modules and providing a more elegant, scalable solution.」ととてもマルチモーダル。
リポジトリはGitHub – happinesslz/UniLION

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents [46.3]
本稿では,ソフトウェア開発エージェントを実装するツールキットであるOpenHands Software Agent SDKを紹介する。柔軟性を達成するために、デフォルトケースで数行のコードしか必要としないエージェントを実装するためのシンプルなインターフェースを設計する。セキュリティと信頼性のために、シームレスなローカル-リモート実行ポータビリティ、REST/WebSocketサービスの統合を提供する。
論文参考訳（メタデータ） (Wed, 05 Nov 2025 18:16:44 GMT)
OpenHandsの論文。「Unlike prior library-only SDKs (Anthropic, 2025a; OpenAI, 2024), OpenHands includes a built-in REST/WebSocket server for remote execution and a suite of interactive workspace interfaces—a browser-based VSCode IDE, VNC desktop, and persistent Chromium browser—for human inspection and control.」と統合された環境としても優秀。
リポジトリはGitHub – OpenHands/software-agent-sdk: A clean, modular SDK for building AI agents with OpenHands V1.

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks [108.2]
大規模モデルを用いたマルチモーダル空間推論タスクの包括的レビューを行う。我々は、視覚言語ナビゲーションやアクションモデルを含む、具体的AIの進歩についてレビューする。我々は,新しいセンサによる空間的理解に寄与する音声やエゴセントリックビデオなどの新たなモダリティを考察する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 17:55:43 GMT)
MLLMのサーベイ。
リポジトリはGitHub – zhengxuJosh/Awesome-Multimodal-Spatial-Reasoning: This repository collects and organises state‑of‑the‑art papers on spatial reasoning for Multimodal Vision–Language Models (MVLMs).

Leveraging LLM-based agents for social science research: insights from citation network simulations

Leveraging LLM-based agents for social science research: insights from citation network simulations [132.4]
CiteAgentフレームワークを導入し、人間-行動シミュレーションに基づく引用ネットワークを生成する。 CiteAgentは、実世界の引用ネットワークにおける主要な現象を捉えている。社会科学において2つのLCMに基づく研究パラダイムを確立し,既存の理論の検証と挑戦を可能にした。
論文参考訳（メタデータ） (Wed, 05 Nov 2025 08:47:04 GMT)
「To further explore the social attributes of LLMs, we introduce the CiteAgent framework, designed to generate citation networks based on human-behavior simulation with LLM-based agents. CiteAgent successfully captures predominant phenomena in real-world citation networks, including power-law distribution, citational distortion, and shrinking diameter.」とのことだが、これでこの手のLLMを活用した社会シミュレーション的なものの有効性をいえるかというと若干疑問のような。
リポジトリはGitHub – Ji-Cather/CiteAgent: Official Implementation of CiteAgent Framework

Thought Branches: Interpreting LLM Reasoning Requires Resampling

Thought Branches: Interpreting LLM Reasoning Requires Resampling [11.0]
一つのサンプルを研究することは因果的影響と基礎となる計算を理解するのに不十分であると主張する。モデル決定のための再サンプリングを用いたケーススタディを提案する。
論文参考訳（メタデータ） (Fri, 31 Oct 2025 14:02:37 GMT)
「we can measure a partial CoT’s impact by resampling only the subsequent text. We present case studies using resampling to investigate model decisions. First, when a model states a reason for its action, does that reason actually cause the action?」と、CoTへの介入とその影響に関する報告。先行研究を含めて面白い動作分析。この報告では「We address this by repeatedly resampling to remove sentences and by measuring resilience, the number of interventions required to erase a sentence’s content from a trace. 」などCoTの過程の分布にも注目し計算コストは高いが納得性の高い手法を用いている。

OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation

OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation [39.3]
Omni-1Mは、文書レイアウトの最初の100万スケールデータセットである。 2段階学習パラダイムを設計した0.5BモデルであるOmni-LLMを紹介する。私たちのコード、モデル、データセットは公開されます。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 07:39:54 GMT)
文書レイアウトのデータセットOmniLayout-1M及びOmniLayout-LLMの提案。
「Our code, models, and dataset will be publicly released.」とのこと

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning

MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning [73.3]
本稿では,メモリを反復的に保持し,現在のターンと組み合わせたエージェントワークフローであるMemSearcherを提案する。それぞれのターンで、MemSearcherはユーザーの質問をメモリに融合させ、推論トレースを生成し、検索アクションを実行し、メモリを更新してタスクの解決に必要な情報のみを保持する。我々は,MemSearcher Agents の推論,検索戦略,メモリ管理を協調的に最適化する,エンドツーエンドの RL フレームワークである Multi-context GRPO を紹介する。
論文参考訳（メタデータ） (Tue, 04 Nov 2025 18:27:39 GMT)
「We introduce MemSearcher, an agentic workflow that leverages the backbone LLM as a memory manager to iteratively maintain a compact memory, preserving only the essential information necessary for answering the user’s question and thereby eliminating the need to append the entire interaction history to the LLM context. • We develop search agents based on MemSearcher, and utilize multi-context GRPO, a natural extension of GRPO, to optimize LLMs to reason, leverage search engines and manage memory simultaneously.」とメモリ関連の機能尾をうまく扱えるように強化学習されたモデルの提案。「MemSearcher based on Qwen2.5-3B-Instruct achieves a higher average score than other methods based on Qwen2.5-7B-Instruct.」と効果を確認。
リポジトリはGitHub – icip-cas/MemSearcher

Scaling Agent Learning via Experience Synthesis

Scaling Agent Learning via Experience Synthesis [100.4]
強化学習(RL)は、対話を通じて自己改善を行うことで、大規模言語モデル(LLM)エージェントを強化することができる。私たちはDreamGymを紹介します。DreamGymはスケーラビリティを念頭において多様なエクスペリエンスを合成するために設計された最初の統合フレームワークです。高価な実環境のロールアウトに頼るのではなく、DreamGymは環境のダイナミクスを推論ベースのエクスペリエンスモデルに蒸留する。
論文参考訳（メタデータ） (Wed, 05 Nov 2025 18:58:48 GMT)
「To synthesize diverse agent experiences for RL training, DreamGym is built around three key components: (1) a scalable reasoning experience model that encodes the meta-dynamics of the target domain to efficiently generate informative trajectories; (2) an experience replay buffer that integrates offline environment knowledge with online synthetic transitions, co-evolving with the agent to stay aligned with its updated policy; (3) a curriculum task generator that produces progressively challenging variations of high-value tasks selected via a reward-entropy heuristic.」と強力な合成フレームワーク。

A Survey on Unlearning in Large Language Models

A Survey on Unlearning in Large Language Models [18.3]
大規模言語モデル(LLM)は自然言語処理に革命をもたらしたが、大規模なコーパスでのトレーニングは重大なリスクをもたらす。これらの問題を緩和し、「忘れられる権利」のような法的・倫理的な基準に合わせるために、機械の非学習は重要なテクニックとして現れてきた。この調査は、2021年以降に出版されたLLMアンラーニングに関する180以上の論文の体系的なレビューを提供する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 02:34:17 GMT)
社会実装上重要だが簡単ではないunlearningのサーベイ

Thinking with Video, V-Thinker

推論時にマルチモーダルなデータを活用する研究が進んでいる。

Thinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm [73.5]
シンキング・ウィズ・ビデオ」パラダイムは、視覚的・テキスト的推論を統合的時間的枠組みで橋渡しする。 Sora-2はビジョン中心のタスクの有能な推論者として確立されている。テキスト中心のタスクでは、Sora-2はMATHで92%、MMMUで75.53%の精度を達成している。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 17:25:23 GMT)
「Moving beyond the traditional paradigms of “Thinking with Text” (e g , Chain-of-Thought [3, 37]) and “Thinking with Images”, we propose “Thinking with Video”. It naturally enables human-like dynamic reasoning through video generation, such as drawing and imagination.」と動画を使った思考。
プロジェクトサイトはThinking with Video: Video Generation as a Promising Multimodal Reasoning Paradigm、リポジトリはGitHub – tongjingqi/Thinking-with-Video: We introduce “Thinking with Video”, a new paradigm leveraging video generation for unified multimodal reasoning. Our VideoThinkBench shows that Sora-2 surpasses GPT5 by 10% on eyeballing puzzles and reaches 75% accuracy on MMMU, positioning video generation as a promising multimodal reasoning paradigm.

V-Thinker: Interactive Thinking with Images [22.6]
視覚中心推論の最近の進歩は、大型マルチモーダルモデル(LMM)のための有望な「シンキング・ウィズ・イメージ」パラダイムを探求しているエンド・ツー・エンドの強化学習を通じてインタラクティブな視覚中心の思考を可能にする汎用マルチモーダル推論アシスタントであるV-Thinkerを提案する。 V-Thinkerは、一般的な推論シナリオと対話的な推論シナリオの両方において、強力なLMMベースのベースラインを一貫して上回る。
論文参考訳（メタデータ） (Thu, 06 Nov 2025 15:32:29 GMT)
「we introduce V-Thinker, a general-purpose multimodal reasoning assistant that fosters interactive vision-centric thinking via end-to-end reinforcement training.」と視覚を活用した思考を行うアシスタントの提案。
リポジトリはGitHub – We-Math/V-Thinker

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31