ChatPaper.aiChatPaper

深度全景:全景深度估計的基礎模型

Depth Any Panoramas: A Foundation Model for Panoramic Depth Estimation

December 18, 2025
作者: Xin Lin, Meixi Song, Dizhe Zhang, Wenxuan Lu, Haodong Li, Bo Du, Ming-Hsuan Yang, Truong Nguyen, Lu Qi
cs.AI

摘要

本研究提出了一種全景測距深度基礎模型,能夠泛化適用於不同場景距離。我們從數據建構與框架設計雙重視角探索了數據驅動的閉環範式。通過整合公開數據集、基於UE5模擬器生成的高質量合成數據、文本生成圖像模型產出的數據,以及從網絡採集的全景實景圖像,我們構建了大型數據集。為縮小室內/室外與合成/真實數據間的領域差距,我們設計了三階段偽標註提純流程,為未標註圖像生成可靠的真實深度標籤。模型方面,採用具備強大預訓練泛化能力的DINOv3-Large作為骨幹網絡,並創新性地引入即插即用的距離遮罩頭、以銳度為核心的優化策略及以幾何一致性為核心的優化方法,從而提升模型對不同距離的魯棒性並強化多視角間的幾何約束。在多個基準測試(如Stanford2D3D、Matterport3D和Deep360)上的實驗表明,該模型不僅具有優異性能與零樣本泛化能力,更能於各類真實場景中實現魯棒且穩定的測距深度預測。項目頁面詳見:https://insta360-research-team.github.io/DAP_website/
English
In this work, we present a panoramic metric depth foundation model that generalizes across diverse scene distances. We explore a data-in-the-loop paradigm from the view of both data construction and framework design. We collect a large-scale dataset by combining public datasets, high-quality synthetic data from our UE5 simulator and text-to-image models, and real panoramic images from the web. To reduce domain gaps between indoor/outdoor and synthetic/real data, we introduce a three-stage pseudo-label curation pipeline to generate reliable ground truth for unlabeled images. For the model, we adopt DINOv3-Large as the backbone for its strong pre-trained generalization, and introduce a plug-and-play range mask head, sharpness-centric optimization, and geometry-centric optimization to improve robustness to varying distances and enforce geometric consistency across views. Experiments on multiple benchmarks (e.g., Stanford2D3D, Matterport3D, and Deep360) demonstrate strong performance and zero-shot generalization, with particularly robust and stable metric predictions in diverse real-world scenes. The project page can be found at: https://insta360-research-team.github.io/DAP_website/ {https://insta360-research-team.github.io/DAP\_website/}
PDF292December 20, 2025