视频模型具备零样本学习与推理能力。

摘要

大型语言模型（LLMs）卓越的零样本能力，推动了自然语言处理从任务专用模型向统一、通用的基础模型转变。这一变革源于简单的构建模块：基于网络规模数据训练的大型生成模型。有趣的是，同样的构建原则也适用于当今的生成式视频模型。视频模型是否正沿着通向通用视觉理解的道路发展，正如LLMs发展出通用语言理解能力那样？我们展示了Veo 3能够解决一系列它未专门训练的任务：物体分割、边缘检测、图像编辑、理解物理属性、识别物体功能、模拟工具使用等。这些感知、建模和操控视觉世界的能力，使其具备了解决迷宫和对称性等早期视觉推理任务的能力。Veo涌现的零样本能力表明，视频模型正朝着成为统一、通用的视觉基础模型迈进。

English

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

视频模型具备零样本学习与推理能力。

Video models are zero-shot learners and reasoners

摘要

Support