ChatPaper.aiChatPaper

直觀物理理解源自於對自然影片的自監督預訓練

Intuitive physics understanding emerges from self-supervised pretraining on natural videos

February 17, 2025
作者: Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun
cs.AI

摘要

我們研究了在自然影片中預測遮罩區域的通用深度神經網路模型,其直覺物理理解能力的形成。利用違反期望框架,我們發現訓練於預測學習表示空間中結果的影片預測模型,展現了對多種直覺物理特性的理解,如物體恆存性與形狀一致性。相比之下,在像素空間進行的影片預測以及通過文本推理的多模態大型語言模型,其表現更接近隨機猜測。我們對這些架構的比較揭示,聯合學習一個抽象表示空間同時預測感官輸入的缺失部分,類似於預測編碼,足以獲得對直覺物理的理解,並且即使僅訓練於一週獨特影片的模型也能達到超越隨機的表現。這挑戰了核心知識——一套幫助理解世界的先天系統——需要被硬編碼才能發展出直覺物理理解的觀點。
English
We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

Summary

AI-Generated Summary

PDF192February 18, 2025