world model – ページ 2 – arXiv最新論文の紹介

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text

The World is Your Canvas: Painting Promptable Events with Reference Images, Trajectories, and Text [101.7]
Worldcanvasは、リッチでユーザ指向のシミュレーションを可能にする、プロンプト可能なワールドイベントのためのフレームワークである。表現力のある世界イベント生成をサポートすることで、Worldcanvasは、受動的予測器からインタラクティブなユーザ形状のシミュレータまで、世界モデルを前進させる。
論文参考訳（メタデータ） (Thu, 18 Dec 2025 18:59:59 GMT)
「World models [3, 12, 15, 22, 38, 46] are unlocking their true potential, evolving from passive simulators into interactive canvases for creation. A landmark step in this evolution is the introduction of “promptable world events,” a concept pioneered by models like Genie 3 [3], which transforms the world model into an interactive canvas where text prompts can trigger significant environmental changes.」という前提のもと、「By enabling users to precisely specify what, when, where, and who through intuitive motion trajectories, natural language and ref images, our approach supports semantic actions, complex interactions, object entry/exit and reference- guided appearance.」が可能なモデルを構築。
プロジェクトサイトはThe World is Your Canvas

Large Video Planner Enables Generalizable Robot Control

Large Video Planner Enables Generalizable Robot Control [117.5]
汎用ロボットは、様々なタスクや環境にまたがって一般化する意思決定モデルを必要とする。最近の研究は、マルチモーダル大言語モデル(LM)をアクション出力で拡張し、視覚-アクション(VLA)システムを構築することで、ロボット基盤モデルを構築している。本稿では,ロボット基礎モデル構築における主要なモダリティとして,大規模ビデオ事前学習を用いるための代替パラダイムについて検討する。
論文参考訳（メタデータ） (Wed, 17 Dec 2025 18:35:54 GMT)
「We present Large Video Planner (LVP), a 14-billion parameter video foundation model for embodiment planning. LVP generates videos as motion plans conditioned on one or a few scene frames and a text description of the task. We demonstrate that these generated motion plans can be successfully retargeted to dexterous robotic hands using open-source reconstruction and retargeting tools. Evaluations on third-party proposed tasks show evidence of task-level generalization, a capability limited in existing VLA models.」と動画をカギとするロボット用の行動計画モデルの提案。
関連手法の進化を見るに、有力なアプローチに思えなくもない。

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World [100.7]
エージェントは現実的な4D駆動環境を合成し、説得力があるように見えるが、物理的または行動的に失敗することが多い。モデルがどのように構築され、理解され、その生成された世界の中でどのように振る舞うかを評価するフルスペクトルベンチマークであるWorldLensを紹介します。さらに、数値的なスコアとテキストの合理性を備えた人間の注釈付きビデオの大規模データセット WorldLens-26K を構築し、WorldLens-Agent を開発した。
論文参考訳（メタデータ） (Thu, 11 Dec 2025 18:59:58 GMT)
「We introduce WorldLens, a full-spectrum benchmark evaluating how well a model builds, understands, and behaves within its generated world. It spans five aspects – Generation, Reconstruction, Action-Following, Downstream Task, and Human Preference – jointly covering visual realism, geometric consistency, physical plausibility, and functional reliability.」というベンチマーク。
リポジトリはGitHub – worldbench/WorldLens: 🌐 WorldLens: Full-Spectrum Evaluations of Driving World Models in Real World、プロジェクトサイトはWorldLens: Full-Spectrum Evaluations of Driving World Models in Real World

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark

Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark [48.0]
ビデオ生成モデルは、Chain-of-Frames (CoF)推論を通じて、潜在的な世界シミュレータとして登場した。既存のベンチマークは、忠実さやアライメントに重点を置いており、CoFの推論を評価していない。我々は,認知科学と実世界のAI応用を基盤としたフレームワークであるGen-ViReを紹介する。
論文参考訳（メタデータ） (Mon, 17 Nov 2025 19:11:39 GMT)
ビデオ生成モデルを通じた因果関係の把握（world modelへの可能性）を評価するベンチマークの提案。「Gen-ViRe evaluates six core cognitive dimensions: (1) Perceptual, (2) Analogical, (3) Abstract, (4) Planning, (5) Spatial & Temporal, and (6) Algorithmic & Logical, with each dimension comprising four different sub-categories.」
「Sora-2 achieves the highest overall score (0.560), establishing the top tier with particularly strong performance in the most cognitively demanding domains: “Abstract Reasoning” (0.604), “Algorithmic & Logical” (0.472), and “Perceptual” (0.496). The second tier comprises three highly competitive models—Hailuo-2.3 (0.493), Wan-2.5 (0.490), and Veo-3.1 (0.486)—each exhibiting distinct specialized strengths. Hailuo-2.3 achieves the highest score in “Planning” (0.778), showcasing exceptional sequential decision-making capabilities, while Wan-2.5 leads in “Analogy” (0.500), excelling at analogical reasoning.」とモデルごとに特性がかなり異なるのが興味深い。
リポジトリはhttps://github.com/L-CodingSpace/GVR

A Step Toward World Models: A Survey on Robotic Manipulation

A Step Toward World Models: A Survey on Robotic Manipulation [58.7]
本稿では,ロボット操作の手法のレビューを通じて,世界モデルのコア機能を示すアプローチについて考察する。我々は、認識、予測、制御にまたがる役割を分析し、主要な課題と解決策を特定し、現実世界のモデルが持つべきコアコンポーネント、能力、機能を抽出する。
論文参考訳（メタデータ） (Fri, 31 Oct 2025 00:57:24 GMT)
「In this survey, rather than directly imposing a fixed definition and limiting our scope to methods explicitly labeled as world models, we examine approaches that exhibit the core capabilities of world models through a review of methods in robotic manipulation. We analyze their roles across perception, prediction, and control, identify key challenges and solutions, and distill the core components, capabilities, and functions that a real world model should possess.」とのこと。

GPT-5.1, ERNIE 5, Marble, SIMA2

先週もGPT-5.1の公開（GPT-5.1: A smarter, more conversational ChatGPT | OpenAI）、ERNIE 5の公開（XユーザーのBaidu Inc.さん: 「Here comes ERNIE 5.0 — our latest natively omni-modal foundational model. It excels in omni-modal understanding, creative writing, instruction following, and more. We will continue investing in and developing more cutting-edge models to push the boundaries of intelligence. https://t.co/S3L1Tlre2n」 / X）などニュースが続いた。評価はこれから、という感じではあるが大規模展開をすぐに行っていくのがすごい。

動画生成、３D生成など生成モデルをベースとしてWorld Model構築のトライが流行っており、Marble: A Multimodal World Model | World Labsも要注目である。同じく先週発表されたSIMA 2: A Gemini-Powered AI Agent for 3D Virtual Worlds – Google DeepMindのなかでGenie3（Genie 3: A new frontier for world models – Google DeepMind）への言及がある通りAI Agentが学ぶ場としても有効に思える。AIの内心・想像の世界としても有効性が指摘されていてホットな領域。

World Simulation with Video Foundation Models for Physical AI

World Simulation with Video Foundation Models for Physical AI [181.8]
我々は,[Cosmos-Predict2.5]と[Cosmos-Transfer2.5]を,エンボディインテリジェンスをスケールするための汎用ツールとしてリリースする。我々はNVIDIA Open Model Licenseの下で、ソースコード、事前訓練されたチェックポイント、およびキュレートされたベンチマークをリリースします。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 22:44:13 GMT)
VLAモデル用の合成データや自動運転等で活用可能なworld simulator、Cosmos World Foundation Model Platform for Physical AI – arXiv最新論文の紹介からのアップデート。「[Cosmos-Predict2.5] and [Cosmos-Transfer2.5], the latest Cosmos video world foundation models for Physical AI」
プロジェクトサイトはDeep Imagination Research | NVIDIA、リポジトリはGitHub – nvidia-cosmos/cosmos-predict2.5: Cosmos-Predict2.5, the latest version of the Cosmos World Foundation Models (WFMs) family, specialized for simulating and predicting the future state of the world in the form of video.

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark [124.0]
我々は、ビデオモデルがゼロショット推論器として機能する準備が整っているかどうかを実証研究する。私たちは、人気の高いVeo-3に注力しています。我々は,空間的,幾何学的,物理的,時間的,具体的論理を含む12次元にわたる推論行動を評価する。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 17:59:55 GMT)
「Video models are zero-shot learners and reasoners – arXiv最新論文の紹介」という主張もあるが、異なるチームによる論文。「Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models.」とのことで可能性を感じる結果ではある。
プロジェクトサイトはAre Video Models Ready as Zero-Shot Reasoners?

Co-Evolving Latent Action World Models, SPICE : Self-Play In Corpus Environments Improves Reasoning, Critique-RL, Parrot

先週、2つの異なるものを共に進化させ性能向上を図る論文が複数出ていた。このようなフレームワークとしてはGANが有名ではあるが、LLM basedな時代でもしばしば見るアプローチで非常に興味深い。

Co-Evolving Latent Action World Models [57.5]
学習済みのビデオモデルを潜在アクションを介して制御可能な世界モデルに適応させることは、ジェネラリストの世界モデルを作成するための有望なステップである。本稿では,この相乗的パラダイムを初めて実現したCoLA-Worldを提案する。世界モデルは知識のある家庭教師として機能し、高品質のLAMを形成するための勾配を提供する。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 12:28:40 GMT)
「We propose CoLA-World, the first framework that successfully enables joint training of a latent action model with a pre-trained video-generation-based world model.」とlatent action model (LAM) と world modelを共に生成

SPICE: Self-Play In Corpus Environments Improves Reasoning [58.8]
SPICEは、単一のモデルが2つの役割で機能する強化学習フレームワークである。チャレンジャーは、様々な推論タスクを生成するために、大きなコーパスから文書をマイニングする。本分析は,SPICEにおける文書の基盤化が,ますます困難な目標を連続的に生み出す上で,いかに重要な要素であるかを明らかにする。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 17:46:16 GMT)
「SPICE is a self-play framework where a single LLM, πθ, acts in two roles: a Challenger (role = C), which poses difficult questions, and a Reasoner (role = R), which tries to correctly answer such questions. The Challenger uses a raw document (which does not contain existing questions or labels) from a corpus to generate a (q, a∗) pair.」とChallengerとReasonerを使う強化学習フレームワーク

Critique-RL: Training Language Models for Critiquing through Two-Stage Reinforcement Learning [89.6]
より強力な監督を伴わないクオリティク言語モデルを開発するためのオンラインRLアプローチであるCrytique-RLを提案する。提案手法は,アクターが応答を生成し,批評家がフィードバックを提供し,アクターがそれに応じて応答を洗練する,という2段階のパラダイムに基づいている。さまざまなタスクやモデルに対する実験では、Cristique-RLが大幅なパフォーマンス改善を実現している。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 11:37:01 GMT)
「In stage I, it reinforces the discriminability of the critic with direct rule-based reward signals; in stage II, it introduces indirect rewards based on actor refinement to improve the critic’s helpfulness, while maintaining its discriminability via appropriate regularization. Extensive experiments across various tasks and models show that Critique-RL delivers substantial performance improvements.」と２ステージ構成の批評家モデルの強化（Actor側は更新されないので他とは異なるが）
リポジトリはGitHub – WooooDyy/Critique-RL

Parrot: A Training Pipeline Enhances Both Program CoT and Natural Language CoT for Reasoning [69.0]
自然言語のチェーン・オブ・シント(N-CoT)とプログラム・チェーン・オブ・シント(P-CoT)は、数学的な推論問題を解決するために、大規模言語モデル(LLM)の2つの主要なパラダイムとして登場した。数学的問題に対する新しいトレーニングパイプラインであるParrotを提案する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 09:23:17 GMT)
Natural language chain-of-thought (N-CoT) とProgram chain-of-thought (P-CoT)の両強化、「The pipeline comprises three target-designed subtasks: Information Retrieval trains the model to concentrate on key information within problem. P-CoT Reasoning utilizes the information to generate variable well- defined code solutions. Paradigm Conversion enhances N-CoT with concise P-CoT and its intermediate outputs.」の3サブタスクを前提としている。

MiniMax M2, Kimi-Linear, Ling-V2, Ouro, Emu3.5, gpt-oss-safeguard

先週は公開モデルの話題が多く、その中でもMiniMax-M2 とKimi-Linearは要注目。特に後者は効率性も高い。先週のRingとややこしいが、Ling-V2も強力なモデルである（This report focuses on three reflex-grade non-thinking (instruct) models in the Ling 2.0 family—Ling-mini-2.0, Ling-flash-2.0, and Ling-1T. These models emphasize general reasoning and instruction-following capability, while the Ring series (Ling-Team, 2025), built upon the same Ling 2.0 base, extends toward deep thinking models.とのこと）。また、小型モデルであるOuro-2.6B 、Ouro-2.6B-Thinkingも興味深かった。

上記とは異なるがマルチモーダルなEmu3.5、分類タスク（safety classification tasks）用のgpt-oss-safeguardなど強力なモデルが公開されるのは良いことだと思う。（最後の例は想定活用例が他とはだいぶ異なりそうではあるが。。）

Kimi Linear: An Expressive, Efficient Attention Architecture [75.9]
Kimi Linearはハイブリッドな線形アテンションアーキテクチャで、初めて、公正な比較で完全にアテンションを上回ります。中心となるKimi Delta Attention (KDA)は、Gated DeltaNetを拡張した表現力のある線形アテンションモジュールである。我々は,Kimi Linearがより優れた性能と効率で十分な注意を払って,ドロップインで置き換えられることを示す。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 16:59:43 GMT)
「At its core lies Kimi Delta Attention (KDA), a hardware-efficient linear attention module that extends Gated DeltaNet [111] with a finer-grained gating mechanism. While GDN, similar to Mamba2 [16], employs a coarse head-wise forget gate, KDA introduces a channel-wise variant in which each feature dimension maintains an independent forgetting rate, akin to Gated Linear Attention (GLA) [114]. This fine-grained design enables more precise regulation of the finite-state RNN memory, unlocking the potential of RNN-style models within hybrid architectures.」をハイブリッド構成で活用。
GitHub – MoonshotAI/Kimi-Linear

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation [149.0]
Ling 2.0は、すべてのアクティベーションが推論能力を促進するという原則に基づいて構築された一連の推論指向の言語基盤である。 Ling 2.0は、経験的スケーリング法則によって導かれる、高い分散性、クロススケール一貫性、効率性を強調している。シリーズには、Ling-mini-2.0、Ling-flash-2.0、Ling-1Tの3つの非思考モデルが含まれている。
論文参考訳（メタデータ） (Sat, 25 Oct 2025 01:51:37 GMT)
長いReasoningにフォーカスしたRing-1Tとはことなり、一般的な推論や指示に従う能力にフォーカス
GitHub – inclusionAI/Ling-V2: Ling-V2 is a MoE LLM provided and open-sourced by InclusionAI.

Scaling Latent Reasoning via Looped Language Models [109.6]
事前学習されたループ言語モデル(LoopLM)のファミリーであるOuroを提示し、オープンソース化する。 Ouro は (i) 潜時空間における反復計算, (ii) 学習深度割り当てのためのエントロピー規則化された目的, (iii) 7.7T トークンへのスケーリングによる事前学習段階への推論を構築する。
論文参考訳（メタデータ） (Wed, 29 Oct 2025 17:45:42 GMT)
Looped Language Model (LoopLM) architectureによるモデル構築の報告。「we introduced Ouro, a family of Looped Language Models that demonstrate exceptional parameter efficiency by integrating iterative computation and adaptive depth directly into pre-training on 7.7T tokens. Our 1.4B and 2.6B models consistently match or exceed the performance of 4B and 8B standard transformers, showcasing a 2-3× efficiency gain.」と非常に効率が高い。
Ouro: Looped Language Models

Parallel Loop Transformer for Efficient Test-Time Computation Scaling [34.8]
大規模言語モデル(LLM)は強力だが、推論中に現実世界で使うには遅すぎるしコストもかかる。ループ変換器は、複数の計算ステップで同じ重みを再利用することでパラメータを節約する。ループが次々と実行され、各追加ループで推論遅延とメモリ要求が増大する。
論文参考訳（メタデータ） (Tue, 28 Oct 2025 15:35:50 GMT)
こちらは並列のParallel Loop Transformer (PLT)

Emu3.5: Native Multimodal Models are World Learners [65.9]
Emu3.5は大規模マルチモーダル世界モデルで、視覚と言語をまたいだ次の状態をネイティブに予測する。 Emu3.5は、視覚言語間のインターリーブデータのコーパスに基づいて、一貫した次トーケン予測目標を持つ、エンドツーエンドで事前訓練された。それは、一貫した世界探索とオープンワールドの具体的操作を可能にする、一般化可能な世界モデリング能力を示す。
論文参考訳（メタデータ） (Thu, 30 Oct 2025 15:11:16 GMT)
Emuシリーズ（Emu3: Next-Token Prediction is All You Need – arXiv最新論文の紹介）の最新版。「Emu3.5 further exhibits generalizable worldmodeling abilities encompassing world exploration and embodied manipulation, enabling controllable interaction, free-form navigation, and dynamic scene simulation across both real and imagined environments. We carefully evaluate these new capabilities and demonstrate clear superiority of Emu3.5, a single 32B unified model, over the closed-source Gemini 2.5 Flash Image [91].」とのこと。
emu.world/pages/web/landingPage、GitHub – baaivision/Emu3.5: Native Multimodal Models are World Learners

2026年5月
月	火	水	木	金	土	日
				1	2	3
4	5	6	7	8	9	10
11	12	13	14	15	16	17
18	19	20	21	22	23	24
25	26	27	28	29	30	31