視訊模型具備零樣本學習與推理能力

摘要

大型語言模型（LLMs）的卓越零樣本能力，已將自然語言處理從特定任務模型推向統一、通用的基礎模型。這一轉變源於簡單的基礎：在網絡規模數據上訓練的大型生成模型。有趣的是，這些相同的基礎也適用於當今的生成式視頻模型。視頻模型是否正朝著通用視覺理解的方向發展，就像LLMs發展出通用語言理解能力一樣？我們展示了Veo 3能夠解決多種未經專門訓練的任務：物體分割、邊緣檢測、圖像編輯、理解物理屬性、識別物體功能、模擬工具使用等。這些感知、建模和操縱視覺世界的能力，使得早期形式的視覺推理如迷宮和對稱性解決成為可能。Veo湧現的零樣本能力表明，視頻模型正朝著成為統一、通用的視覺基礎模型的方向邁進。

English

The remarkable zero-shot capabilities of Large Language Models (LLMs) have propelled natural language processing from task-specific models to unified, generalist foundation models. This transformation emerged from simple primitives: large, generative models trained on web-scale data. Curiously, the same primitives apply to today's generative video models. Could video models be on a trajectory towards general-purpose vision understanding, much like LLMs developed general-purpose language understanding? We demonstrate that Veo 3 can solve a broad variety of tasks it wasn't explicitly trained for: segmenting objects, detecting edges, editing images, understanding physical properties, recognizing object affordances, simulating tool use, and more. These abilities to perceive, model, and manipulate the visual world enable early forms of visual reasoning like maze and symmetry solving. Veo's emergent zero-shot capabilities indicate that video models are on a path to becoming unified, generalist vision foundation models.

視訊模型具備零樣本學習與推理能力

Video models are zero-shot learners and reasoners

摘要

Support