arXiv – ページ 19 – arXiv最新論文の紹介

ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning

ToolLibGen: Scalable Automatic Tool Creation and Aggregation for LLM Reasoning [80.1]
外部ツールを備えたLarge Language Models (LLM) は、複雑な推論タスクにおけるパフォーマンスの向上を実証している。このツールに強化された推論が広く採用されるのは、ドメイン固有のツールが不足しているためである。構造化ツールライブラリに非構造化ツールのコレクションを自動的に組み込むための体系的なアプローチを提案する。
論文参考訳（メタデータ） (Thu, 09 Oct 2025 04:11:16 GMT)
LLMが使用するツールを整理するためのフレームワーク。ツールを自動作成しているアプローチもあるのでその整理は有用。
リポジトリはGitHub – SalesforceAIResearch/ToolLibGen

VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification

VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification [107.8]
証拠を厳格に検証し,回答の帰属性を高めるために設計された,VeriCiteと呼ばれる新しいフレームワークを紹介する。我々は,5つのオープンソースLCMと4つのデータセットを対象とした実験を行い,VeriCiteが回答の正しさを維持しつつ,引用品質を大幅に向上できることを実証した。
論文参考訳（メタデータ） (Mon, 13 Oct 2025 13:38:54 GMT)
RAGにおける引用品質を高めるための「 initial answer generation, supporting evidence selection, and final answer refinement」からなるフレームワークの提案。
リポジトリはGitHub – QianHaosheng/VeriCite: Repo for VeriCite: Towards Reliable Citations in Retrieval-Augmented Generation via Rigorous Verification

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models

Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models [30.3]
データ汚染は大規模言語モデル(LLM)の信頼性評価に重大な脅威をもたらすこの問題は、ベンチマークサンプルが必然的にトレーニングセットに現れ、報告されたパフォーマンスの有効性を損なうことになる。本稿では,RLポストトレーニングのための特殊汚染検出手法として,自己批判を提案する。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 10:58:50 GMT)
コンタミネーション検知のためのSelf-Critique「. The method compares token-level entropy sequences between the initial response and the self-critique response. High similarity in entropy space indicates contamination (policy collapse), while low similarity indicates clean samples.」が興味深い。
リポジトリはGitHub – yongding-tao/RL-Data-Contamination

A Survey of Vibe Coding with Large Language Models

A Survey of Vibe Coding with Large Language Models [93.9]
視覚符号化(Vibe Coding)は、開発者が成果観察を通じてAI生成の実装を検証する開発手法である。変革の可能性にもかかわらず、この創発的パラダイムの有効性は未解明のままである。この調査は、大規模な言語モデルによるVibe Codingの総合的かつ体系的なレビューを初めて提供する。
論文参考訳（メタデータ） (Tue, 14 Oct 2025 11:26:56 GMT)
「a novel development methodology termed “Vibe Coding” where developers validate AI-generated implementations through outcome observation rather than line-by- line code comprehension.」とVibe codingのサーベイ。。。
リポジトリはGitHub – YuyaoGe/Awesome-Vibe-Coding

通常の（？）ソフトウェアエンジニアリングのサーベイも出ていた。

A Comprehensive Survey on Benchmarks and Solutions in Software Engineering of LLM-Empowered Agentic System [54.9]
本調査は, LLMを利用したソフトウェア工学の総合的解析を初めて行ったものである。我々は150以上の最近の論文を分析し、2つの主要な次元にまたがる包括的分類に分類する。我々の分析は、この分野が単純なプロンプトエンジニアリングから複雑なエージェントシステムへとどのように進化してきたかを明らかにする。
論文参考訳（メタデータ） (Fri, 10 Oct 2025 06:56:50 GMT)
software engineering + LLM based agentsのサーベイ

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation

LLM-Specific Utility: A New Perspective for Retrieval-Augmented Generation [110.6]
Retrieval-augmented Generation (RAG)は、外部知識を取り入れた大規模言語モデル(LLM)を強化する。既存の研究はしばしばユーティリティをジェネリック属性として扱い、異なるLLMが同じ通路から異なる利益をもたらすという事実を無視している。
論文参考訳（メタデータ） (Mon, 13 Oct 2025 12:57:45 GMT)
「(1) We highlight the new perspective of utility for RAG, i.e., LLM-specific utility. (2) We introduce the LLM-specific utility judgment task, propose a benchmarking procedure, and provide a comprehensive empirical analysis of various LLMs and methods.(3) We identify the key direction in achieving more effective LLM-specific utility judgment: known queries should reject all passages, while unknown ones must identify useful ones, which need to be analyzed further.」とのこと。そうだよねという印象で、RAGの特性を整理するうえでも参考になる。
リポジトリはAnonymized Repository – Anonymous GitHub

Self-Improvement in Multimodal Large Language Models: A Survey

Self-Improvement in Multimodal Large Language Models: A Survey [34.4]
LLM(Large Language Models)の自己改善は、コストを大幅に増大させることなく、効率的にモデル機能を強化している。この調査は、マルチモーダル LLM における自己改善に関する総合的な概要を提供する最初のものである。
論文参考訳（メタデータ） (Fri, 03 Oct 2025 01:48:26 GMT)
Self improvementに関するサーベイ。「We provide a structured overview of the current literature and discuss methods from three perspectives: 1) data col- lection, 2) data organization, and 3) model optimization, to facilitate the further development of self-improvement in MLLMs. We also in- clude commonly used evaluations and down- stream applications.」

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning

UALM: Unified Audio Language Model for Understanding, Generation and Reasoning [124.2]
統一音声言語モデル (Unified Audio Language Model, UALM) は、音声理解、テキスト音声生成、マルチモーダル推論を単一モデルで統一することを目的としている。最初にUALM-Genを提示する。これは音声トークンを直接予測し,最先端の拡散モデルに匹敵する言語モデルである。 UALM-Reasonは、テキストと音声の両方を中間的思考ステップで活用し、複雑な生成作業を容易にするマルチモーダル推論モデルである。
論文参考訳（メタデータ） (Mon, 13 Oct 2025 22:55:01 GMT)
NVIDIAによるaudio understanding, text-to-audio generation, multimodal reasoningが可能な単一モデルUALM: Unified Audio Language Modelの提案。UALM: Unified Audio Language Model for Understanding, Generation, and Reasoning – NVIDIA ADLRでデモが提供されている、
リポジトリはaudio-intelligence/UALM at main · NVIDIA/audio-intelligence · GitHub

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training [55.7]
構造化されたUI状態と遷移を生成するスケーラブルなパラダイムを導入し、大規模にトレーニングトラジェクトリを合成する。このパラダイムは、多様なUI状態のためのデジタルワールドシミュレータ、コヒーレント探索のためのガイド付きロールアウトプロセス、軌道ラッパーを統合している。 WebArenaとAndroidWorldの実験では、UI-Simulatorは実際のUIでトレーニングされたオープンソースエージェントと競合するか、あるいは超越している。
論文参考訳（メタデータ） (Thu, 16 Oct 2025 17:59:38 GMT)
「We introduced UI-Simulator, a scalable trajectory synthesis paradigm that uses LLM-based digital world simulators to synthesize diverse UI trajectories at scale through multi-step simulation, guided rollouts, and final trajectory wrapping.」とGUIエージェント構築に活用できるデータ合成フレームワークの提案。
リポジトリはGitHub – WadeYin9712/UI-Simulator: Code for 🌍 UI-Simulator: LLMs as Scalable, General-Purpose Simulators For Evolving Digital Agent Training

InternVLA-M1, Vlaser

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy [138.9]
空間接地とロボット制御のための統合フレームワークであるInternVLA-M1を紹介する。 InternVLA-M1は、(i)2.3M以上の空間的推論データに基づく空間的グラウンドトレーニングと(ii)空間的に誘導された後トレーニングという、2段階のパイプラインを使用する。結果: InternVLA-M1 は SimplerEnv Google Robot で+14.6%、WidowX で+17%、LIBERO Franka で+4.3% で、空間誘導なしでその変種を上回った。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 17:30:05 GMT)
Shanghai AI LaboratoryによるVLAフレームワーク、「On SimplerEnv (Google Robot and WidowX), InternVLA-M1 achieves a new state-of-the-art, surpassing its variant by improving the average success rate by up to +5.9% and +9.8%, respectively. It also demonstrates strong spatial reasoning capabilities across box, point, and trace prediction tasks.」。
アーキテクチャは「InternVLA-M1 employs the Qwen2.5-VL- 3B-instruct Bai et al (2025a) as the multimodal encoder for System 2, which is to capture spatial priors. It adopts the diffusion policy Chi et al (2023) (86 M) as the Action Expert (System 1, the fast executor), which effectively models embodiment-specific control. This expert is built on the DINOv2 visual encoder Oquab et al (2023) (21 M) and a lightweight state encoder (0.4 M), forming a compact vision–action model. In total, InternVLA-M1 comprises approximately 4.1B parameters.」と公開モデルの意義を感じる構成。spatial promptingをコアとしてSystem2 → System1を活用する構成。
「To bridge the gap between VLM and VLA, we introduce a Post-Pre-Training phase, where large-scale simulated data is used to pre-train the VLA after VLM pre-training. This stage initializes the action head and facilitates the learning of action representations.」というアプローチも注目。
リポジトリはGitHub – InternRobotics/InternVLA-M1: InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning [124.5]
Vlaser – 相乗的具体的推論機能を備えたビジョン・ランゲージ・アクション・モデルを紹介する。 Vlaserは、様々な具体的推論ベンチマークで最先端のパフォーマンスを達成する。提案手法は,WidowXベンチマークの最先端結果と,Google Robotベンチマークの競合性能を実現する。
論文参考訳（メタデータ） (Mon, 13 Oct 2025 05:51:22 GMT)
こちらはInternVL3 ベース、「In this work, we reveal that current embodied reasoning benchmarks exhibit a significant domain gap when compared to real-world robots. This core domain shift arises from the observation that robots have a fundamentally different viewpoint from that of internet datasets.」とデータの重要性を強調。
リポジトリはGitHub – OpenGVLab/Vlaser: Vlaser: Vision-Language-Action Model with Synergistic Embodied Reasoning

The Role of Computing Resources in Publishing Foundation Model Research

The Role of Computing Resources in Publishing Foundation Model Research [84.2]
我々はこれらの資源と基礎モデル(FM)の科学的発展との関係を評価する。我々は2022年から2024年にかけて発行された6517のFM論文をレビューし、計算資源が科学出力に与える影響について229人の第一著者を調査した。計算量の増加は国家予算配分や引用と相関していることがわかったが,研究環境との強い相関はみられない。
論文参考訳（メタデータ） (Wed, 15 Oct 2025 14:50:45 GMT)
計算リソースと研究成果の関係に関する分析。「We found that projects with access to greater GPU power generally produce more advanced pre-trained models, often achieving higher performance thanks to longer training on larger models and datasets.」という示唆はそうだろうなーと思うしなかなか開示できない事情は理解しつつも「This is generally a serious reporting gap: only 16.51% of papers include GPU quantity information, 24.22% specify GPU types, and just 12.86% report inference times.」は問題だと思う。
プロジェクトサイトはChasing Compute – Foundation Model Research

2026年1月
月	火	水	木	金	土	日
			1	2	3	4
5	6	7	8	9	10	11
12	13	14	15	16	17	18
19	20	21	22	23	24	25
26	27	28	29	30	31