视频模型具备零样本学习与推理能力。
Video models are zero-shot learners and reasoners
September 24, 2025
作者: Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, Robert Geirhos
cs.AI
摘要
大型语言模型(LLMs)卓越的零样本能力,推动了自然语言处理从任务专用模型向统一、通用的基础模型转变。这一变革源于简单的构建模块:基于网络规模数据训练的大型生成模型。有趣的是,同样的构建原则也适用于当今的生成式视频模型。视频模型是否正沿着通向通用视觉理解的道路发展,正如LLMs发展出通用语言理解能力那样?我们展示了Veo 3能够解决一系列它未专门训练的任务:物体分割、边缘检测、图像编辑、理解物理属性、识别物体功能、模拟工具使用等。这些感知、建模和操控视觉世界的能力,使其具备了解决迷宫和对称性等早期视觉推理任务的能力。Veo涌现的零样本能力表明,视频模型正朝着成为统一、通用的视觉基础模型迈进。
English
The remarkable zero-shot capabilities of Large Language Models (LLMs) have
propelled natural language processing from task-specific models to unified,
generalist foundation models. This transformation emerged from simple
primitives: large, generative models trained on web-scale data. Curiously, the
same primitives apply to today's generative video models. Could video models be
on a trajectory towards general-purpose vision understanding, much like LLMs
developed general-purpose language understanding? We demonstrate that Veo 3 can
solve a broad variety of tasks it wasn't explicitly trained for: segmenting
objects, detecting edges, editing images, understanding physical properties,
recognizing object affordances, simulating tool use, and more. These abilities
to perceive, model, and manipulate the visual world enable early forms of
visual reasoning like maze and symmetry solving. Veo's emergent zero-shot
capabilities indicate that video models are on a path to becoming unified,
generalist vision foundation models.