시각-언어 모델에서 밀집 미터법 깊이 추정 구현

초록

시각-언어 모델(VLM)은 접지 및 캡셔닝과 같은 2차원 작업에서 뛰어난 성능을 보이지만, 3차원 이해에는 여전히 한계가 있다. 주요 제약 요인은 텍스트 전용 감독 패러다임으로, 이는 세밀한 시각적 인식을 충분히 제약하지 못하고 밀집 기하 구조를 복원하는 것을 방해한다. 기존 방법들은 외부 시각 모델로부터 기하 정보를 추출하여 오류를 누적시키거나, 비효율적인 픽셀별 질의 또는 조밀하지 않은 토큰 수준의 출력을 통해 직접 예측을 가능하게 하였다. 본 논문에서는 DepthVLM을 제안한다. 이는 단일 VLM을 다중 모달 기능을 유지하면서도 본질적인 밀집 기하 예측기로 변환하는 간단하면서도 효과적인 프레임워크이다. LLM 백본에 경량 깊이 헤드를 부착하고, 2단계 일정으로 통합 시각-텍스트 감독 패러다임 하에 훈련함으로써, DepthVLM은 단일 순방향 전달로 언어 출력과 함께 전체 해상도의 깊이 맵을 생성한다. 또한 VLM 호환 형식의 통합 실내-실외 미터법 깊이 벤치마크를 도입한다. 실험 결과, DepthVLM은 기존 VLM 대비 현저히 높은 추론 효율성을 보이며, 선도적인 순수 시각 모델을 능가하고, 복잡한 3차원 공간 추론을 개선하여 진정한 통합 기반 모델로 나아가고 있음을 입증한다. 모든 코드와 체크포인트는 공개될 예정이다.

English

Vision-Language Models (VLMs) excel at 2D tasks such as grounding and captioning, yet remain limited in 3D understanding. A key limitation is their text-only supervision paradigm, which under-constrains fine-grained visual perception and prevents the recovery of dense geometry. Prior methods either distill geometry from external vision models, introducing error accumulation, or enable direct prediction with inefficient per-pixel query or coarse token-level outputs. In this paper, we propose DepthVLM, a simple yet effective framework that transforms a single VLM into a native dense geometry predictor while preserving its multimodal capability. By attaching a lightweight depth head to the LLM backbone and training under a unified vision-text supervision paradigm with a two-stage schedule, DepthVLM generates full-resolution depth maps alongside language outputs in a single forward pass. We further introduce a unified indoor-outdoor metric depth benchmark in a VLM-compatible format. Experiments show that DepthVLM significantly outperforms existing VLMs with higher inference efficiency, surpasses leading pure vision models, and improves complex 3D spatial reasoning, moving toward a truly unified foundation model. All code and checkpoints will be publicly released.