2025年10月20日 – arXiv最新論文の紹介

InternVLA-M1, Vlaser

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy [138.9]
空間接地とロボット制御のための統合フレームワークであるInternVLA-M1を紹介する。 InternVLA-M1は、(i)2.3M以上の空間的推論データに基づく空間的グラウンドトレーニングと(ii)空間的に誘導された後トレーニングという、2段階のパイプラインを使用する。結果: InternVLA-M1 は SimplerEnv Google Robot で+14.6%、WidowX で+17%、LIBERO Franka で+4.3% で、空間誘導なしでその変種を上回った。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 17:30:05 GMT)
Shanghai AI LaboratoryによるVLAフレームワーク、「On SimplerEnv (Google Robot and WidowX), InternVLA-M1 achieves a new state-of-the-art, surpassing its variant by improving the average success rate by up to +5.9% and +9.8%, respectively. It also demonstrates strong spatial reasoning capabilities across box, point, and trace prediction tasks.」。
アーキテクチャは「InternVLA-M1 employs the Qwen2.5-VL- 3B-instruct Bai et al (2025a) as the multimodal encoder for System 2, which is to capture spatial priors. It adopts the diffusion policy Chi et al (2023) (86 M) as the Action Expert (System 1, the fast executor), which effectively models embodiment-specific control. This expert is built on the DINOv2 visual encoder Oquab et al (2023) (21 M) and a lightweight state encoder (0.4 M), forming a compact vision–action model. In total, InternVLA-M1 comprises approximately 4.1B parameters.」と公開モデルの意義を感じる構成。spatial promptingをコアとしてSystem2 → System1を活用する構成。
「To bridge the gap between VLM and VLA, we introduce a Post-Pre-Training phase, where large-scale simulated data is used to pre-train the VLA after VLM pre-training. This stage initializes the action head and facilitates the learning of action representations.」というアプローチも注目。
リポジトリはGitHub – InternRobotics/InternVLA-M1: InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.5]
Vlaser – 相乗的具体的推論機能を備えたビジョン・ランゲージ・アクション・モデルを紹介する。 Vlaserは、様々な具体的推論ベンチマークで最先端のパフォーマンスを達成する。提案手法は,WidowXベンチマークの最先端結果と,Google Robotベンチマークの競合性能を実現する。
論文参考訳（メタデータ） (Mon, 13 Oct 2025 05:51:22 GMT)
こちらはInternVL3 ベース、「In this work, we reveal that current embodied reasoning benchmarks exhibit a significant domain gap when compared to real-world robots. This core domain shift arises from the observation that robots have a fundamentally different viewpoint from that of internet datasets.」とデータの重要性を強調。
リポジトリはGitHub – OpenGVLab/Vlaser: Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

The Role of Computing Resources in Publishing Foundation Model Research

The Role of Computing Resources in Publishing Foundation Model Research [84.2]
我々はこれらの資源と基礎モデル(FM)の科学的発展との関係を評価する。我々は2022年から2024年にかけて発行された6517のFM論文をレビューし、計算資源が科学出力に与える影響について229人の第一著者を調査した。計算量の増加は国家予算配分や引用と相関していることがわかったが,研究環境との強い相関はみられない。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 14:50:45 GMT)
計算リソースと研究成果の関係に関する分析。「We found that projects with access to greater GPU power generally produce more advanced pre-trained models, often achieving higher performance thanks to longer training on larger models and datasets.」という示唆はそうだろうなーと思うしなかなか開示できない事情は理解しつつも「This is generally a serious reporting gap: only 16.51% of papers include GPU quantity information, 24.22% specify GPU types, and just 12.86% report inference times.」は問題だと思う。
プロジェクトサイトはChasing Compute – Foundation Model Research

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI

Beyond Pipelines: A Survey of the Paradigm Shift toward Model-Native Agentic AI [27.2]
エージェントAIの急速な進化は、人工知能の新しいフェーズを象徴している。この調査はエージェントAI構築におけるパラダイムシフトをトレースする。それぞれの能力が外部スクリプトモジュールからエンドツーエンドの学習行動へとどのように進化したかを調べる。
論文参考訳（メタデータ） (Sun, 19 Oct 2025 05:23:43 GMT)
「The rapid evolution of agentic AI marks a new phase in artificial intelligence, where Large Language Models (LLMs) no longer merely respond but act, reason, and adapt. This survey traces the paradigm shift in building agentic AI: from Pipeline-based systems, where planning, tool use, and memory are orchestrated by external logic, to the emerging Model-native paradigm, where these capabilities are internalized within the model’s parameters.」とAIエージェントの進化に関するサーベイ。整理の仕方が興味深い。
リポジトリはGitHub – ADaM-BJTU/model-native-agentic-ai: Our survey’s paper list on Agentic AI, continuously updated with the latest research.

Large Language Models Do NOT Really Know What They Don’t Know

Large Language Models Do NOT Really Know What They Don’t Know [37.6]
最近の研究は、大言語モデル(LLM)が、その内部表現に事実性信号をエンコードしていることを示唆している。 LLMは、ショートカットやスプリアスアソシエーションに頼ることで、事実エラーを発生させることもできる。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 06:09:04 GMT)
Associated Hallucinations (AHs) とUnassociated Hallucinations (UHs)を区別して分析し、「LLMs do not encode truthfulness in their hidden states but only patterns of knowledge recall and utilization, showing that LLMs don’t really know what they don’t know.」と主張。

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors

Evaluating Robustness of Large Language Models Against Multilingual Typographical Errors [45.4]
大規模言語モデル(LLM)は、ユーザ入力を持つマルチリンガルな実世界のアプリケーションにますます多くデプロイされている。ほとんどのベンチマークはクリーンな入力を前提としており、LLMの堅牢性は、ほとんど探索されていないタイプミスに委ねられている。 MulTypoは,言語固有のキーボードレイアウトとタイピング行動に基づいて,ヒューマンライクなエラーをシミュレートする多言語型タイポ生成アルゴリズムである。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 16:49:12 GMT)
タイプミスがLLMのパフォーマンスにどのくらい影響を与えるかの評価、「Our results show that typos consistently degrade performance, particularly in generative tasks and those requiring reasoning – while the natural language inference task is comparatively more robust.」とのこと。日本語での影響が気になる。
リポジトリはGitHub – mainlp/Multypo-Eval

Qwen3Guard Technical Report

Qwen3Guard Technical Report [127.7]
Qwen3Guardは、多言語安全ガードレールモデルである。生成的Qwen3Guardは、きめ細かい三級判定を可能にする命令追従タスクとして安全分類をキャストする。 Stream Qwen3Guardは、リアルタイム安全監視のためのトークンレベルの分類ヘッドを導入している。
論文参考訳（メタデータ） (Thu, 16 Oct 2025 04:00:18 GMT)
「we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments.」とQwen3ベースのガードレールもモデル。
リポジトリはGitHub – QwenLM/Qwen3Guard: Qwen3Guard is a multilingual guardrail model series developed by the Qwen team at Alibaba Cloud.

2025年10月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31