VLM3: 비전 언어 모델은 본질적인 3D 학습자

초록

비전-언어 모델(VLM)은 프롬프팅을 통해 다양한 비전 작업을 통합된 모델로 해결할 수 있게 한다. 이들은 의미 이해에서 유망한 성능을 보여주었다. 그러나 3D 이해는 여전히 복잡한 작업별 설계를 가진 전문 비전 모델에 크게 의존하고 있다. 본 연구가 주장하는 핵심은 VLM이 본질적으로 3D 학습자라는 점이다. 당사의 대규모 심층 연구는 1) 초점 거리 통합, 2) 텍스트 기반 픽셀 참조, 3) 데이터 혼합 및 스케일링이 효과적인 3D 학습에 필요한 전부임을 보여준다. 모델 아키텍처 변경, 대규모 모델, 과도한 데이터 증강, 회귀 공식을 포함한 복잡한 손실 함수 등 전문 비전 모델의 기반을 이루는 많은 요소들은 사실 필요 조건이 아니다. 이에 따라 우리는 표준 VLM이 다양한 3D 작업을 마스터할 수 있게 하는 가장 단순한 설계의 확장 가능한 방법인 VLM3를 제안한다. VLM3는 VLM 깊이 추정 정확도를 큰 폭으로 향상시킬 뿐만 아니라(0.84 -> 0.9), 픽셀 대응, 카메라 자세 추정, 객체 수준 3D 이해와 같은 다양한 3D 작업을 가능하게 하여, 표준 아키텍처와 텍스트 기반 학습을 유지하면서도 전문 비전 모델의 정확도에 도달한다. 우리는 VLM3가 단순하고 확장 가능한 3D 학습을 위한 새로운 패러다임을 연다고 믿는다.

English

Vision Language Models (VLMs) enable a unified model to solve various vision tasks through prompting. They have shown promising performance in semantic understanding. However, 3D understanding still largely relies on expert vision models with complex task-specific designs. The key argument this work wants to make is that VLMs are native 3D learners. Our in-depth large scale study shows that 1) focal length unification, 2) text-based pixel reference and 3) data mixture and scaling, are all you need for effective 3D learning. Model architecture changes, large models, heavy data augmentations, and complex losses including the regression formulation, many of which form the foundation of expert vision models, are actually not necessary conditions. As a result, we propose VLM3, a scalable method with the simplest design that enables standard VLMs to master diverse 3D tasks. VLM3 not only advances the VLM depth estimation accuracy by a large margin (0.84 -> 0.9), but also enables diverse 3D tasks such as pixel correspondence, camera pose estimation and object-level 3D understanding, matching expert vision model accuracy while maintaining standard architectures and text-based training. We believe VLM3 opens up a new paradigm for simple and scalable 3D learning.