2025年8月25日 – arXiv最新論文の紹介

Command A Reasoning, DeepSeek V3.1, Gemma 3 270M, Nemotron Nano 2, Dream 7B

LLM/LRM関連の話題は本当に多い。先週はCohere’s Command A Reasoning Model | Cohere（モデルはCohere’s Command A Reasoning Model | Cohere、CC-BY-NC）の公開、DeepSeek V3.1の公開（DeepSeek-V3.1 Release | DeepSeek API Docs、モデルはdeepseek-ai/DeepSeek-V3.1 · Hugging Face）が大きなニュースだった。フロンティアまたはそれに近いモデルが公開される意義は大きい。また、Intern-S1からはテクニカルレポートが公開されている。

小型モデル関連でもGemma 3 270M（Introducing Gemma 3 270M: The compact model for hyper-efficient AI – Google Developers Blog、モデルはgoogle/gemma-3-270m · Hugging Face）は超小型であることが興味深い。性能的には疑問があるとはいえ特化用途にPost trainingするなど使える場面はありそう。NVIDIA のMemtron Nano2も注目である（Nanoという名前で9B）。

HuaweiからはDiffusion系のDream 7Bの論文が出ていた。LLaDAを超え、同規模のAutoregressiveなモデルに負けていなさそうと高い性能。

Intern-S1: A Scientific Multimodal Foundation Model [185.4]
Intern-S1は、一般的な理解と推論機能を備えた専門的なジェネラリストである。 Intern-S1はオフラインおよびオンライン強化学習(RL)をInternBootCampで実施する。 Intern-S1は、オープンソースモデル間の一般的な推論タスクにおける競合性能を示す。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 17:58:00 GMT)
Qwen3-Coder, Intern-S1, Step-Audio2, TeleChat2 – arXiv最新論文の紹介で取り上げたモデルのテクニカルレポート

NVIDIA Nemotron Nano 2: An Accurate and Efficient Hybrid Mamba-Transformer Reasoning Model [176.4]
Nemotron-Nano-9B-v2は、推論処理のスループットを向上させるために設計されたハイブリッドのMamba-Transformer言語モデルである。 Nemotron-Nano-9B-v2はNemotron-Hアーキテクチャをベースにしており、共通のTransformerアーキテクチャの自己保持層の大部分をMamba-2層に置き換えている。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 04:18:04 GMT)
nvidia/NVIDIA-Nemotron-Nano-9B-v2 · Hugging Face

Dream 7B: Diffusion Large Language Models [85.3]
これまでで最も強力なオープン拡散大言語モデルであるDream 7Bを紹介します。我々のモデルは、一般的な、数学的、コーディングタスクにおいて、既存の拡散言語モデルよりも一貫して優れています。
論文参考訳（メタデータ） (Thu, 21 Aug 2025 12:09:58 GMT)
「Dream 7B achieves competitive performance with Qwen 2.5 on standard benchmarks (general language understanding, mathematical reasoning, and code generation) while exhibiting superior planning abilities and novel inference flexibility features that naturally emerge from the diffusion modeling paradigm.」とのこと。
リポジトリはGitHub – DreamLM/Dream: Dream 7B, a large diffusion language model、モデルはDream 7B – a Dream-org Collection

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction

FutureX: An Advanced Live Benchmark for LLM Agents in Future Prediction [84.4]
FutureXは、将来の予測のための最大かつ最も多様なライブベンチマークである。リアルタイムの日次更新をサポートし、質問収集と回答収集のための自動パイプラインを通じてデータの汚染を取り除く。推論,検索機能,外部ツールの統合などを含む25のLLM/エージェントモデルを評価した。
論文参考訳（メタデータ） (Sat, 16 Aug 2025 08:54:08 GMT)
未来予測のためのライブベンチマーク。「we introduce FutureX, a dynamic and live evaluation benchmark specifically designed for LLM agents performing future prediction tasks. FutureX is built upon a semi-automated pipeline that continuously collects future-oriented questions from 195 diverse websites, curated from a pool of 2,008 sites covering areas such as politics, economics, technology, sports, healthcare, and more.」とドメインも広い。
結果として「LLM agents still lag behind humans」ではあるものの、レベル２は人を上回っているエージェントがいるのが興味深いところ。（あとレベル分けは若干違和感がある。。。）
- The Basic tier (Level 1) contains single-choice events with options fewer than 4.
- The Wide Search tier (Level 2) comprises multi-choice events with several correct answers.
- The Deep Search tier (Level 3) contains open-ended events whose underlying facts are relatively stable (with low volatility).
- The Super Agent tier (Level4) covers high-volatility, open-ended events.

Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance

Never Compromise to Vulnerabilities: A Comprehensive Survey on AI Governance [211.1]
本研究は,本質的セキュリティ,デリバティブ・セキュリティ,社会倫理の3つの柱を中心に構築された,技術的・社会的次元を統合した包括的枠組みを提案する。我々は,(1)防衛が進化する脅威に対して失敗する一般化ギャップ,(2)現実世界のリスクを無視する不適切な評価プロトコル,(3)矛盾する監視につながる断片的な規制,の3つの課題を特定する。私たちのフレームワークは、研究者、エンジニア、政策立案者に対して、堅牢でセキュアなだけでなく、倫理的に整合性があり、公的な信頼に値するAIシステムを開発するための実用的なガイダンスを提供します。
論文参考訳（メタデータ） (Tue, 12 Aug 2025 09:42:56 GMT)
「This paper offers a comprehensive overview of AI governance, addressing challenges across intrinsic security, derivative security, and social ethics.」とガバナンスについて概要がまとまった論文。リポジトリもあって良い感じ（だが、リポジトリの論文リストは更新中？）
リポジトリはGitHub – ZTianle/Awesome-AI-SG: Awesome papers and resources related to the AI Safety and Governance

月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31