Welk pretrainingsparadigma dient ruimtelijke intelligentie beter? Een empirische vergelijking van visie-taal- en videogeneratiemodellen

Samenvatting

Ruimtelijke intelligentie vereist visuele representaties die zowel semantische objecten als geometrische structuur in de fysieke wereld vastleggen. Om dit te ondersteunen, worden nu twee belangrijke pre-trainingsschema's veelvuldig gebruikt als funderende backbones: Visie-Taalmodellen (VLM's), die taalsupervisie gebruiken om visuele waarnemingen af te stemmen op semantische concepten, en Videogeneratiemodellen (VGM's), die leren van in de tijd evoluerende visuele werelden. Het blijft echter onduidelijk welk pre-trainingsschema een beter representatiesubstraat biedt voor ruimtelijke intelligentie. In dit artikel presenteren we de eerste systematische frozen-feature probing-studie van VLM's en VGM's langs drie representatieve assen van ruimtelijke intelligentie: semantische tagging, instantiegroepering en 3D-geometrievoorspelling. Met behulp van een lichte probe maakt ons raamwerk een gecontroleerde vergelijking mogelijk van welke informatie al is gecodeerd in bevroren representaties uit twee modelfamilies. Experimentele resultaten tonen een duidelijke complementariteit: VLM's zijn sterker in semantische tagging en instantiegroepering, terwijl VGM's beter toegankelijke signalen bieden voor dichte geometrie en camerabeweging. Bovendien levert een naïeve fusie van beide al een representatie op die uitblinkt in zowel geometrie als semantiek, wat een veelbelovende richting aangeeft voor het bouwen van sterkere backbones voor ruimtelijke intelligentie door effectieve integratie van kenmerken uit beide modelfamilies. Onze code is beschikbaar op https://github.com/om-ai-lab/Probing-VLM-VGM.

English

Spatial intelligence requires visual representations that capture both semantic objects and geometric structure in the physical world. To support this, two major pre-training schemes are now widely used as foundation backbones: Vision-Language Models (VLMs), which use language supervision to align visual observations with semantic concepts, and Video Generation Models (VGMs), which learn from temporally evolving visual worlds. However, it still remains unclear which pre-training scheme provides a better representation substrate for spatial intelligence. In this paper, we present the first systematic frozen-feature probing study of VLMs and VGMs across three representative axes of spatial intelligence: semantic tagging, instance grouping, and 3D geometry prediction. Using the lightweight probe, our framework enables a controlled comparison of what information is already encoded in frozen representations from two model families. Experimental results reveal a clear complementarity: VLMs are stronger at semantic tagging and instance grouping, while VGMs provide more accessible signals for dense geometry and camera motion. Moreover, a naive fusion of the two already yields a representation that excels at both geometry and semantics, suggesting a promising direction for building stronger spatial-intelligence backbones by effectively integrating features from both model families. Our code is available at https://github.com/om-ai-lab/Probing-VLM-VGM{https://github.com/om-ai-lab/Probing-VLM-VGM}.