MetricAnything:基于异构噪声源的大规模度量深度预训练框架
MetricAnything: Scaling Metric Depth Pretraining with Noisy Heterogeneous Sources
January 29, 2026
作者: Baorui Ma, Jiahui Yang, Donglin Di, Xuancheng Zhang, Jianxun Cui, Hao Li, Yan Xie, Wei Chen
cs.AI
摘要
尺度化推动了视觉基础模型的最新进展,但将该范式扩展到度量深度估计仍面临挑战,这源于异构传感器噪声、相机相关偏差以及跨源3D数据中的度量模糊性。我们提出Metric Anything,一种简单可扩展的预训练框架,能够从嘈杂多样的3D数据源学习度量深度,无需人工设计的提示、相机特定建模或任务特定架构。我们方法的核心是稀疏度量提示——通过随机掩码深度图生成,作为通用接口将空间推理与传感器和相机偏差解耦。利用涵盖10000种相机型号的重建、采集和渲染三维数据约2000万张图像-深度对,我们首次在度量深度领域证明了清晰的尺度化趋势。该预训练模型在深度补全、超分辨率和雷达-相机融合等提示驱动任务中表现卓越,其蒸馏出的无提示学生模型则在单目深度估计、相机内参恢复、单/多视角度量三维重建和VLA规划方面达到顶尖水平。我们还证明,使用Metric Anything的预训练ViT作为视觉编码器,能显著提升多模态大语言模型的空间智能能力。这些结果表明,度量深度估计同样受益于驱动现代基础模型的尺度定律,为可扩展的高效现实世界度量感知开辟了新路径。我们在http://metric-anything.github.io/metric-anything-io/开源Metric Anything以支持社区研究。
English
Scaling has powered recent advances in vision foundation models, yet extending this paradigm to metric depth estimation remains challenging due to heterogeneous sensor noise, camera-dependent biases, and metric ambiguity in noisy cross-source 3D data. We introduce Metric Anything, a simple and scalable pretraining framework that learns metric depth from noisy, diverse 3D sources without manually engineered prompts, camera-specific modeling, or task-specific architectures. Central to our approach is the Sparse Metric Prompt, created by randomly masking depth maps, which serves as a universal interface that decouples spatial reasoning from sensor and camera biases. Using about 20M image-depth pairs spanning reconstructed, captured, and rendered 3D data across 10000 camera models, we demonstrate-for the first time-a clear scaling trend in the metric depth track. The pretrained model excels at prompt-driven tasks such as depth completion, super-resolution and Radar-camera fusion, while its distilled prompt-free student achieves state-of-the-art results on monocular depth estimation, camera intrinsics recovery, single/multi-view metric 3D reconstruction, and VLA planning. We also show that using pretrained ViT of Metric Anything as a visual encoder significantly boosts Multimodal Large Language Model capabilities in spatial intelligence. These results show that metric depth estimation can benefit from the same scaling laws that drive modern foundation models, establishing a new path toward scalable and efficient real-world metric perception. We open-source MetricAnything at http://metric-anything.github.io/metric-anything-io/ to support community research.