2025年12月26日 – arXiv最新論文の紹介

LongVie 2: Multimodal Controllable Ultra-Long Video World Model

LongVie 2: Multimodal Controllable Ultra-Long Video World Model [94.9]
LongVie 2はエンドツーエンドの自動回帰フレームワークで、3段階でトレーニングされている。 LongVie 2は、長距離制御性、時間的コヒーレンス、視覚的忠実さにおいて最先端の性能を達成する。
論文参考訳（メタデータ） (Mon, 15 Dec 2025 17:59:58 GMT)
「LongVie 2 achieves state-of-the-art performance in controllable long video generation and can autoregressively synthesize high-quality videos lasting up to 3–5 minutes, marking a significant step toward video world modeling.」とのこと
プロジェクトサイトはLongVie 2

The Role of Risk Modeling in Advanced AI Risk Management

The Role of Risk Modeling in Advanced AI Risk Management [33.4]
急速に進歩する人工知能(AI)システムは、新しい、不確実で、潜在的に破滅的なリスクをもたらす。これらのリスクを管理するには、厳格なリスクモデリングの基盤となる成熟したリスク管理インフラストラクチャが必要です。先進的なAIガバナンスは、同様の二重アプローチを採用するべきであり、検証可能な、確実に安全なAIアーキテクチャが緊急に必要である、と私たちは主張する。
論文参考訳（メタデータ） (Tue, 09 Dec 2025 15:37:33 GMT)
「We conceptualize AI risk modeling as the tight integration of (i) scenario building— causal mapping from hazards to harms—and (ii) risk estimation—quantifying the likelihood and severity of each pathway. We review classical techniques such as Fault and Event Tree Analyses, FMEA/FMECA, STPA and Bayesian networks, and show how they can be adapted to advanced AI.」とのこと、他分野の例や分析方法など参考になる。

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality

The FACTS Leaderboard: A Comprehensive Benchmark for Large Language Model Factuality [70.5]
FACTS Leaderboardは、実際に正確なテキストを生成する言語モデルの能力を総合的に評価するオンラインのリーダーボードスイートである。このスイートは、4つの異なるサブリーダーボード上でのモデルのパフォーマンスを集約することで、事実性の総合的な尺度を提供する。
論文参考訳（メタデータ） (Thu, 11 Dec 2025 16:35:14 GMT)
「The FACTS Leaderboard introduced here is designed to address this need by providing a holistic evaluation suite. It aggregates performance across four specialized sub-leaderboards, each targeting a distinct dimension of factuality. 」というベンチマーク
- FACTS Multimodal tests a model’s ability to combine visual grounding with world knowledge to answer questions about an image.
- FACTS Parametric measures the model’s ability to use its internal knowledge accurately in factoid question use-cases.
- FACTS Search evaluates the practical and increasingly common use case of generating factual responses by interacting with a search tool.
- FACTS Grounding v2 is an updated version of FACTS Grounding, which tests grounding to a given document, with improved judges.
プロジェクトサイトはFACTS Benchmark Suite Leaderboard | Kaggle、フロンティアなモデルはやはり強い。Gemini 3 Pro previewのSearchはさすが。最新モデルでの検証結果が知りたいところ。

月	火	水	木	金	土	日
1	2	3	4	5	6	7
8	9	10	11	12	13	14
15	16	17	18	19	20	21
22	23	24	25	26	27	28
29	30	31