VLM3：視覺語言模型是天生的3D學習者

摘要

視覺語言模型（VLM）透過提示機制實現統一模型解決各類視覺任務，在語意理解方面展現優異效能。然而，三維理解仍高度依賴具備複雜任務特化設計的專家視覺模型。本研究的核心論點在於：視覺語言模型本質上即為三維學習者。經由大規模深度研究，我們發現：1）焦距統一化、2）基於文字的像素參照、以及3）資料混合與擴增，即為實現高效三維學習的必要條件。而模型架構變更、大型模型、重度資料增強、包含回歸公式在內的複雜損失函數——許多構成專家視覺模型基礎的技術——事實上並非必要條件。據此，我們提出VLM3，此可擴展方法採用最簡潔設計，使標準視覺語言模型能掌握多元三維任務。VLM3不僅大幅提升VLM深度估計準確度（0.84 -> 0.9），更能實現像素對應、相機姿態估計及物體層級三維理解等多樣化三維任務，在維持標準架構與文字基礎訓練的前提下，達到專家視覺模型的準確度。我們認為VLM3為簡潔且可擴展的三維學習開創了新典範。

English

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.