staka – ページ 15 – arXiv最新論文の紹介

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

UniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities [53.8]
UniversalRAGは異種情報源からの知識を多様さと粒度で検索・統合するための新しいRAGフレームワークである。本稿では,最も適切なモダリティ固有コーパスを動的に識別し,その内部でターゲット検索を行うモダリティ対応ルーティング機構を提案する。複数のモダリティにまたがる8つのベンチマークでUniversalRAGを検証する。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 13:18:58 GMT)
マルチモーダルなRAGに対応するため「UniversalRAG dynamically determines the most suitable knowledge source to retrieve from, based on the modality requirement of the given query, then routes the retrieval process to the corresponding modality-specific corpus.」というアプローチ。ルーターは「Training-free Router（実験ではGPT-4o）」と「Trained Router （実験ではDistilBERT 、T5-Large）」が試されていて平均的にはTrained Routerが優勢に見える。
プロジェクトサイトはUniversalRAG: Retrieval-Augmented Generation over Multiple Corpora with Diverse Modalities and Granularities

Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning

Nemotron-Research-Tool-N1: Tool-Using Language Models with Reinforced Reasoning [93.3]
DeepSeek-R1同様の学習パラダイムを用いた一連のツール利用言語モデルを開発した。 Nemotron-Research-Tool-N1は、ツール呼び出しの構造的妥当性と機能的正確性のみを評価するバイナリ報酬で最適化されている。実験により、Qwen-2.5-7B/14B-Instruct上に構築されたNemotron-Research-Tool-N1-7BとNemotron-Research-Tool-N1-14Bが最先端の結果を得ることが示された。
論文参考訳（メタデータ） (Fri, 25 Apr 2025 02:55:21 GMT)
「We introduces Nemotron-Research-Tool-N1, a series of tool-using language models trained with a rule-based reinforcement learning.」とルールベースの強化学習の有効性を確認した報告。
リポジトリはGitHub – NVlabs/Tool-N1

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

SPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning [99.6]
セルフプレイ批判(Self-Play Critic、SPC)は、対戦型セルフプレイゲームを通じて推論ステップを評価する能力を進化させる新しいアプローチである。 SPCは、ベースモデルの2つのコピーを微調整して、2つの役割、すなわち「スニーキージェネレータ」と「批判的」を演じる。
論文参考訳（メタデータ） (Sun, 27 Apr 2025 08:45:06 GMT)
「In this paper, we propose a self-play critic with the ability of detecting step-level LLMs reasoning errors. Specifically, we design a sneaky generator to produce incorrect steps and a critic to assess the correctness of each step. Through the adversarial game between these two models, we can continuously generate positive and negative samples for reinforcement learning.」というアプローチの提案。GANっぽいなと思う。
プロジェクトサイトはSPC: Evolving Self-Play Critic via Adversarial Games for LLM Reasoning

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment

A Comprehensive Survey in LLM(-Agent) Full Stack Safety: Data, Training and Deployment [291.0]
本稿では, LLM のトレーニング, 展開, 商業化のプロセス全体を通して, 安全問題を体系的に検討する “フルスタック” の安全性の概念を紹介する。我々の研究は800以上の論文を網羅的にレビューし、包括的カバレッジとセキュリティ問題の体系的な組織化を確保しています。本研究は,データ生成の安全性,アライメント技術,モデル編集,LLMベースのエージェントシステムなど,有望な研究方向を特定する。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 05:02:49 GMT)
安全性に関する包括的な調査
リポジトリにも期待大　bingreeky/full-stack-llm-safety · GitHub

DeepCritic: Deliberate Critique with Large Language Models

DeepCritic: Deliberate Critique with Large Language Models [77.6]
我々は,Large Language Models(LLMs)の数学批判能力の研究と向上に焦点をあてる。 Qwen2.5-7B-Instructをベースとした批判モデルを開発した。
論文参考訳（メタデータ） (Thu, 01 May 2025 17:03:17 GMT)
Deepな批評を行うモデルの提案。「In Stage 1, we first utilize Qwen2.5-72B-Instruct to generate an initial step-wise critique for each step in the solution, followed by an in-depth critique of the initial critique.」、「In Stage 2, we perform RL to the SFT model on either existing human-annotated data or auto-labeled data via Monte Carlo sampling-based correctness estimation, to further stimulate the critique ability of the critic.」の2ステージ構成で構築。Criticモデルは他のモデル出力の修正にも有効なことが知られているが「our 7B critique model is also capable of supervising and correcting the outputs of a 72B generator, demonstrating a potential of weak-to-strong supervision」は興味深い。
リポジトリはGitHub – RUCBM/DeepCritic: Official repository for paper “DeepCritic: Deliberate Critique with Large Language Models”

Qwen3, Phi-4 reasoning, MiMo 7B, OLMo2 1B, Mellum 4B

先週はオープンなモデルのニュースが多かった。その中でもQwen3は大きなニュースである（Qwen3: Think Deeper, Act Faster | Qwen）。MoEなQwen3-235B-A22B, Qwen3-30B-A3Bの他、denseなQwen3-32B, Qwen3-14B, Qwen3-8B, Qwen3-4B, Qwen3-1.7B, Qwen3-0.6Bが公開されている（Qwen3 – a Qwen Collection）。ライセンスはApache-2。また、MicrosoftのPhi-4のreasoningモデル公開（Showcasing Phi-4-Reasoning: A Game-Changer for AI Developers | Microsoft Community Hub、huggingface）も注目。

SLMの発表も多く、XiaomiによりMiMo（GitHub – XiaomiMiMo/MiMo: MiMo: Unlocking the Reasoning Potential of Language Model – From Pretraining to Posttraining）、Ai2によるOLMo release notes | Ai2が興味深い。JetBrainによるMellum（Mellum Goes Open Source: A Purpose-Built LLM for Developers, Now on Hugging Face | The JetBrains Blog）は「Mellum doesn’t try to know everything. It’s designed to do one thing really well: code completion. We call it a focal model – built with purposeful depth and not concerned with chasing breadth.」とある通り特化型。現状、Mellumは十分な性能とは言い難いものの、SLMを特化して強化する、コスパを上げる方向は有望。DeepseekProver-V2の671Bは凄いが、7Bのうまい活用のような組み合わせも重要になると思う。

Phi-4-reasoning Technical Report [42.5]
Phi-4-reasoningは14ビリオンのパラメータ推論モデルであり、複雑な推論タスクにおいて高い性能を実現する。我々はPhi-4-reasoning-plusを開発した。どちらのモデルもDeepSeek-R1-Distill-Llama-70Bモデルのような大きなオープンウェイトモデルよりも優れており、完全なDeepSeek-R1モデルのパフォーマンスレベルに近づいている。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 05:05:09 GMT)
Phi-4シリーズのLRM

Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math [135.1]
CoT(Chain-of-Thought)は大規模言語モデル(LLM)の形式推論能力を著しく向上させるしかし、Small Language Models (SLM) における推論の改善は、モデル能力が限られているため、依然として困難である。本研究では,(1)多種多様な蒸留長CoTデータによる大規模中等教育,(2)高品質長CoTデータによる微調整,(3)厳格な選好データセットを活用したロールアウトDPO,(4)検証リワードを用いた強化学習(RL)の4段階からなるSLMの体系的トレーニングレシピを提案する。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 00:04:35 GMT)
SLMを利用したreasoningモデルの構築。「The resulting Phi-4-Mini-Reasoning model exceeds, on math reasoning tasks, much larger reasoning models, e g , outperforming DeepSeek-R1-Distill-Qwen-7B by 3.2 points and DeepSeek-R1-DistillLlama-8B by 7.7 points on Math-500.」と効果を確認とのこと。
小型のモデルであってもreasoningが有効という興味深い結果。

DeepSeek-Prover-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition [24.5]
我々はDeepSeek-Prover-V2を紹介します。このモデルは、ニューラル定理の証明における最先端のパフォーマンスを達成し、ミニF2Fテストで88.9%のパス比に達し、PutnamBenchの658問題のうち49を解決した。標準ベンチマークに加えて、325の形式化された問題の集合であるProverBenchを導入し、最近のAIMEコンペティションから選択された15の問題を含む評価を強化した。
論文参考訳（メタデータ） (Wed, 30 Apr 2025 16:57:48 GMT)
「We first prompt DeepSeek-V3 to generate a natural-language proof sketch while simultaneously formalizing it into a Lean statement with sorry placeholders for omitted proof details. A 7B prover model then recursively solves the decomposed subgoals. By combining these subgoal proofs, we construct a complete formal proof for the original complex problem.This composed proof is appended to DeepSeek-V3’s original chain-of-thought, creating high-quality cold-start training data for formal mathematical reasoning. 」
リポジトリはGitHub – deepseek-ai/DeepSeek-Prover-V2

Reinforcement Learning for Reasoning in Large Language Models with One Training Example

Reinforcement Learning for Reasoning in Large Language Models with One Training Example [129.1]
1つのトレーニング例(1ショットRLVR)を用いた強化学習は,大規模言語モデル(LLM)の算数推論能力の向上に有効であることを示す。 1ショットRLVRにおける興味深い現象として、クロスドメインの一般化、自己回帰の頻度の増大、トレーニング精度が飽和した後もテスト性能の向上が維持されていることを挙げる。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 09:24:30 GMT)
「We find that selecting one specific example as the training dataset can achieve similar downstream performance to that of the 1.2k DeepScaleR subset (DSR-sub) containing that example. Specifically, this improves the Qwen2.5-Math-1.5B model from 36.0% to 73.6% on MATH500, and from 17.6% to 35.7% on average across 6 mathematical reasoning benchmarks (Fig. 1, 2).」という興味深い報告。「These findings suggest that the reasoning capability of the model is already buried in the base model, and encouraging exploration on a very small amount of data is capable of generating useful RL training signals for igniting LLM’s reasoning capability.」はそうなのだろうと思う。LLMの中には何が入っていてチューニングって何をしているんだろう。。。
リポジトリはGitHub – ypwang61/One-Shot-RLVR: official repository for “Reinforcement Learning for Reasoning in Large Language Models with One Training Example”

ReasonIR: Training Retrievers for Reasoning Tasks

ReasonIR: Training Retrievers for Reasoning Tasks [139.5]
ReasonIR-8Bは一般的な推論タスクのために特別に訓練された最初のレトリバーである。新たに29.9 nDCG@10をリランカなしで、36.9 nDCG@10をリランカで達成している。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 09:49:28 GMT)
合成データを活用し「We trained REASONIR-8B by fine-tuning LLAMA3.1-8B (Touvron et al , 2023) on a combination of public datasets and the synthetic data generated by REASONIR-SYNTHESIZER.」と構築された bi-encoder retrieverの提案。このような手法を用いてなお、BM25とのハイブリッドが有効という点も興味深い。
リポジトリはGitHub – facebookresearch/ReasonIR: Official repository for paper “ReasonIR Training Retrievers for Reasoning Tasks”.、reasonir/ReasonIR-8B · Hugging Face

The Leaderboard Illusion

The Leaderboard Illusion [30.2]
アリーナは最も有能なAIシステムランキングのリーダーボードとして登場した。我々は,ゆがんだ競技場に生じた体系的な問題を同定する。
論文参考訳（メタデータ） (Tue, 29 Apr 2025 15:48:49 GMT)
Chatbot Arena に対する問題点の指摘と改善提案
「We find that undisclosed private testing practices benefit a handful of providers who are able to test multiple variants before public release and retract scores if desired.」、「At an extreme, we identify 27 private LLM variants tested by Meta in the lead-up to the Llama-4 release.」は確かに問題
リーダーボードの設計、運用はとても難しいが、できるところは改善を期待したい

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs [34.4]
メモリは情報をエンコードし、保存し、検索するプロセスである。大規模言語モデル(LLM)の時代において、メモリとは、AIシステムが過去のインタラクションからの情報を保持し、リコールし、使用し、将来の応答とインタラクションを改善する能力である。
論文参考訳（メタデータ） (Tue, 22 Apr 2025 15:05:04 GMT)
取り扱いが難しいLLMの記憶に関するサーベイ。
様々な手法が提案されているものの解決すべき課題が多い。Open Problems and Future Directionsがとても参考になる。

2025年7月
月	火	水	木	金	土	日
	1	2	3	4	5	6
7	8	9	10	11	12	13
14	15	16	17	18	19	20
21	22	23	24	25	26	27
28	29	30	31