「We argue that for video generation to evolve from mere image animation to genuine world modeling (Ha & Schmidhuber, 2018; LeCun, 2022), models must acquire foundational reasoning capabilities akin to human intuitive physics and cognition. Moving beyond superficial fidelity (Huang et al , 2024; Liu et al , 2024b), we propose a formal evaluation framework asking: Can a video model reason about the physical and logical constraints of the content it generates? Drawing on theories of core knowledge and cognitive development (Spelke & Kinzler, 2007; Lake et al , 2017), we posit that robust world simulation rests on five complementary pillars of reasoning:」とのこと。5つは下記の通り。