staka – ページ 80 – arXiv最新論文の紹介

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots [134.0]
汎用ロボットには多目的体と知的な心が必要だ。近年のヒューマノイドロボットの進歩は、汎用的な自律性を構築するためのハードウェアプラットフォームとして大きな可能性を秘めている。我々はヒューマノイドロボットのオープン基盤モデルであるGR00T N1を紹介する。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 21:06:21 GMT)
NVIDIAによるヒューマノイドロボットをターゲット（「GR00T N1, an open foundation model for generalist humanoid robots」）としたVison-Language-Actionモデルの提案。「We design a compositional model that integrates Vision-Language Model (VLM)-based reasoning module (System 2) and Diffusion Transformer (DiT)-based action module (System 1) in a unified learning framework;」という構成。
リポジトリはGitHub – NVIDIA/Isaac-GR00T: NVIDIA Isaac GR00T N1 is the world’s first open foundation model for generalized humanoid robot reasoning and skills.、nvidia/PhysicalAI-Robotics-GR00T-X-Embodiment-Sim · Datasets at Hugging Face

Mistral Small 3.1, Hunyuan-T1

週刊LLM、LRMという感じだが、先週も話題は多かった。Mistral Small 3.1 | Mistral AIは公開モデルでGemma 3などと競合する性能を主張。NVIDIAのllama-3.3-nemotron-super-49b-v1 Model by NVIDIA | NVIDIA NIMは高効率化の件で興味深い結果を出していた。

Tencentからは事前アナウンスの通りMamba hybridなLRM、Hunyuan-T1が発表された（腾讯混元、Hunyuan T1 – a Hugging Face Space by tencent、llm.hunyuan.T1）。Deepseek R1やo1と比べても十分な性能に見える。

AntropicからWeb検索との連動（Claude can now search the web \ Anthropic）、OpenAIからは新たな音声関連モデルが発表される（Introducing next-generation audio models in the API | OpenAI, OpenAI.fm）など、ビジネス上はLLM・LRMの提供だけでなく周辺領域を埋めていくことが重要になりそう。

Empowering LLMs in Decision Games through Algorithmic Data Synthesis

Empowering LLMs in Decision Games through Algorithmic Data Synthesis [29.1]
意思決定ゲームは、大規模言語モデルの推論能力を評価し、強化するための理想的なサンドボックスとして機能する。データ合成戦略を設計し、2つの古典ゲーム、DoudizhuとGoから広範囲のオフラインデータセットをキュレートする。我々は、このデータをLLMトレーニングに効果的に組み込むための一連の技術を開発し、その結果、Mastermind-Dou と Mastermind-Go という2つの新しいエージェントを生み出した。
論文参考訳（メタデータ） (Tue, 18 Mar 2025 07:30:29 GMT)
一般的に数学やコード生成を対象にLRM化が行われているがこの論文では「Through a suite of our designed techniques in data collection and training, we have developed MasterMind agents, demonstrating commendable performance in both Doudizhu and Go.」とゲームが対象。「Empirical experiments also serve to substantiate the potential of this approach in improving general reasoning capabilities of LLMs.」というのがとても興味深い。人間でいうところの「脳によい〇〇」的なタスクがあるのだろうか。（もっとも性能が落ちるタスクがあることも指摘されているが・・・）
データセットが公開されている。OpenDILabCommunity/MasterMind · Datasets at Hugging Face

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models [11.3]
ロングチェーン・オブ・ソート(Long CoT)特性は推論能力を高め、複雑な問題の解決を可能にする。まず、Long CoTとShort CoTを区別し、現在の推論パラダイムを分類する新しい分類法を導入する。次に,Long CoTの出現やオーバー思考,テストタイムスケーリングなど,これらの特徴について考察する。
論文参考訳（メタデータ） (Wed, 12 Mar 2025 17:35:03 GMT)
LRMでキーとなっているLong Chain of thoughtのサーベイ。「We first distinguish Long CoT from Short CoT and introduce a novel taxonomy to categorize current reasoning paradigms.」と（通常の）Short CoTと Long CoTを分けている。
リポジトリはTowards Reasoning Era: A Survey of Long Chain-of-Thought

Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models

Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models [39.7]
Retrieval-Augmented Generation (RAG)は、外部知識を統合することで、大規模言語モデル(LLM)における幻覚を緩和する。パラメトリック知識と検索コンテキストの対立は、RAGに課題をもたらす。パラメトリックおよび文脈知識へのRAGの依存度を制御するためのプラグイン・アンド・プレイ方式である*CK-PLUG*を提案する。
論文参考訳（メタデータ） (Thu, 20 Mar 2025 06:26:28 GMT)
LLM内部の知識（arametric knowledge ）とRAGのRetirerverなどから与えられる知識（retrieved context）のバランスをとる手法、CK-PLUG (Controllable Knowledge Plug-in)の提案。
リポジトリはGitHub – byronBBL/CK-PLUG: Official repository of paper “Parameters vs. Context: Fine-Grained Control of Knowledge Reliance in Language Models”

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond

Light-R1: Curriculum SFT, DPO and RL for Long COT from Scratch and Beyond [14.4]
私たちはまず、長いCOT能力を欠いたモデルから始まる、長いCOTモデルをスクラッチからトレーニングすることに重点を置いています。 Qwen2.5-32B-Instructから2段階のSFTとセミオン・ポリティクスDPOからなるカリキュラムトレーニングレシピを用いて、我々のモデルであるLight-R1-32Bをトレーニングする。 AIME24と25のスコアはそれぞれ74.0と60.2であり、Light-R1-14B-DSは32BモデルとDeepSeek-R1-Distill-Llama-70Bを抜いた。
論文参考訳（メタデータ） (Thu, 13 Mar 2025 15:29:22 GMT)
2ステージのSFT＋DPO Optimization（＋ model merge）で構築したモデル。「High-Quality Data is All You Need」の通りデータセット側のパイプラインも凝っている。他の研究成果でも近いことが指摘されているが「Despite being trained exclusively on math data, Light-R1-32B shows strong generalization across other domains.」は興味深い。
リポジトリはGitHub – Qihoo360/Light-R1

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions

Qilin: A Multimodal Information Retrieval Dataset with APP-level User Sessions [39.2]
検索・レコメンデーション(S&R)を伴う複雑なシステムにおけるユーザエクスペリエンス向上の課題は、学術と産業の両方から大きな注目を集めている。本稿では,新しいマルチモーダル情報検索データセット,すなわちQilinを提案する。データセットはXiaohongshuから収集されている。Xiaohongshuは3億人の月間アクティブユーザーがいて、平均的な検索浸透率は70%を超えている。
論文参考訳（メタデータ） (Sat, 01 Mar 2025 14:15:00 GMT)
マルチモーダルなsearch and recommendationを対象としたデータセット
リポジトリはGitHub – RED-Search/Qilin: Resources and code for the Qilin dataset.

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k

Open-Sora 2.0: Training a Commercial-Level Video Generation Model in $200k [39.5]
商用レベルのビデオ生成モデルであるOpen-Sora 2.0について紹介する。トップパフォーマンスビデオ生成モデルのトレーニングコストは,高い制御性を有することを示す。 Open-Sora 2.0を完全にオープンソースにすることで、先進的なビデオ生成技術へのアクセスを民主化することを目指している。
論文参考訳（メタデータ） (Wed, 12 Mar 2025 05:00:07 GMT)
その名の通りオープンなビデオ生成モデルの提案。
リポジトリはGitHub – hpcaitech/Open-Sora: Open-Sora: Democratizing Efficient Video Production for All

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

PlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models [10.3]
画像を生成する前に空間配置条件を事前に計画できる統合レイアウト計画と画像生成モデルPlanGenを提案する。 PlanGenは、ローカルキャプションとバウンディングボックス座標の特別なエンコーディングを必要とせずに、レイアウト条件をコンテキストとしてモデルに統合する。さらに、よく設計されたモデリングのおかげで、PlanGenはレイアウト誘導の画像操作にシームレスに拡張できる。
論文参考訳（メタデータ） (Thu, 13 Mar 2025 07:37:09 GMT)
画像生成の前にレイアウト計画可能なモデルの提案。コンテキストとしてレイアウトを受け取ることが可能「PlanGen can complete layout planning and layout-to-image generation in a unified model. Just like thinking about what object each area should be before generating an image, such an explicit planning process allows the model to enjoy more powerful image generation capabilities.」。
リポジトリはPlanGen: Towards Unified Layout Planning and Image Generation in Auto-Regressive Vision Language Models

Biomedical Foundation Model: A Survey

Biomedical Foundation Model: A Survey [84.3]
ファンデーションモデルは、広範なラベルなしデータセットから学習する大規模な事前訓練モデルである。これらのモデルは、質問応答や視覚的理解といった様々な応用に適応することができる。本研究は,生物医学分野における基礎モデルの可能性を探るものである。
論文参考訳（メタデータ） (Mon, 03 Mar 2025 22:42:00 GMT)
生物学、医学分野の基盤モデルのサーベイ、主な対象は「computational biology, drug development, clinical informatics, medical imaging, and public health」

2026年7月
月	火	水	木	金	土	日
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31