ChatPaper.aiChatPaper

视觉智能再思考:从视频预训练中获得的启示

Rethinking Visual Intelligence: Insights from Video Pretraining

October 28, 2025
作者: Pablo Acuaviva, Aram Davtyan, Mariam Hassan, Sebastian Stapf, Ahmad Rahimi, Alexandre Alahi, Paolo Favaro
cs.AI

摘要

大型语言模型(LLM)已证明,在大规模预训练的支持下,系统能够以极少监督快速适应语言领域的新问题。然而,这一成功经验未能有效迁移至视觉领域——包括LLM在内的各类模型仍在组合理解、样本效率和通用问题解决能力方面存在不足。我们探索视频扩散模型(VDM)作为弥合这一差距的潜在路径。通过对时空数据进行预训练,这类模型获得了对结构与动态特征的强归纳偏置,我们推测这种特性可支撑广泛的任务适应性。为验证该假设,我们设计了对照实验:为预训练的LLM和VDM分别配备轻量级适配器,使其在各自原生模态下执行任务。在ARC-AGI、ConceptARC、视觉游戏、路径规划和元胞自动机等基准测试中,VDM展现出优于语言模型的数据效率。综合结果表明,视频预训练所提供的归纳偏置有望推动视觉基础模型的发展。
English
Large language models (LLMs) have demonstrated that large-scale pretraining enables systems to adapt rapidly to new problems with little supervision in the language domain. This success, however, has not translated as effectively to the visual domain, where models, including LLMs, continue to struggle with compositional understanding, sample efficiency, and general-purpose problem-solving. We investigate Video Diffusion Models (VDMs) as a promising direction for bridging this gap. Pretraining on spatiotemporal data endows these models with strong inductive biases for structure and dynamics, which we hypothesize can support broad task adaptability. To test this, we design a controlled evaluation in which both a pretrained LLM and a pretrained VDM are equipped with lightweight adapters and presented with tasks in their natural modalities. Across benchmarks including ARC-AGI, ConceptARC, visual games, route planning, and cellular automata, VDMs demonstrate higher data efficiency than their language counterparts. Taken together, our results indicate that video pretraining offers inductive biases that support progress toward visual foundation models.
PDF51December 1, 2025