「(1) a large-scale dataset, LongVideo-Reason, comprising 52K long video QA pairs with high-quality reasoning annotations across diverse domains such as sports, games, and vlogs; (2) a two-stage training pipeline that extends VLMs with chain-of-thought supervised fine-tuning (CoT-SFT) and reinforcement learning (RL); and (3) a training infrastructure for long video RL, named Multi-modal Reinforcement Sequence Parallelism (MR-SP), which incorporates sequence parallelism and a vLLM-based engine tailored for long video, using cached video embeddings for efficient rollout and prefilling.」を使用しての長い動画を理解するためのフレームワークの提案
「Unlike domains such as math or code reasoning, where structured supervision and benchmarks are readily available [7, 8], long video reasoning requires annotating complex temporal dynamics, goals, spatial relations, and narrative elements—often across minutes or hours of footage」と、コード生成や数学的推論とは異なる難しさがある。