哪種預訓練範式更能服務於空間智能？視覺-語言與視訊生成模型的實證比較

摘要

空間智能需要能同時捕捉語義物體與物理世界中幾何結構的視覺表徵。為此，兩種主要的預訓練方案現已廣泛用作基礎骨幹：視覺語言模型（VLM）透過語言監督將視覺觀察與語義概念對齊；影片生成模型（VGM）則從時間動態變化的視覺世界中學習。然而，目前仍不清楚何種預訓練方案能為空間智能提供更優異的表徵基礎。本文首度針對VLM與VGM，在空間智能的三個代表性軸向——語義標記、實例分組與三維幾何預測——進行系統性的凍結特徵探測研究。藉由輕量探測器，我們的框架得以控制比較兩類模型族群的凍結表徵中已編碼的資訊。實驗結果揭示明確的互補性：VLM在語義標記與實例分組上更具優勢，而VGM則為密集幾何與相機運動提供更易提取的信號。此外，兩者的簡單融合即能產生在幾何與語義上皆表現優異的表徵，這暗示可透過有效整合兩類模型族群的特徵，為建構更強大的空間智能骨幹指明方向。我們的程式碼已公開於 https://github.com/om-ai-lab/Probing-VLM-VGM。

English

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at https://github.com/om-ai-lab/Probing-VLM-VGM{https://github.com/om-ai-lab/Probing-VLM-VGM}.