DINOv3

摘要

自監督學習有望消除手動數據標註的需求，使模型能夠輕鬆擴展至大規模數據集和更龐大的架構。由於不針對特定任務或領域進行定制，這一訓練範式具備從多樣化來源（從自然圖像到航拍圖像）學習視覺表徵的潛力——僅需使用單一算法。本技術報告介紹了DINOv3，這是實現這一願景的重要里程碑，通過採用簡單而有效的策略。首先，我們通過精心的數據準備、設計和優化，充分利用了數據集和模型規模擴展的優勢。其次，我們引入了一種名為Gram錨定的新方法，有效解決了在長時間訓練過程中密集特徵圖退化的已知但未解問題。最後，我們應用事後策略，進一步增強了模型在分辨率、模型大小及與文本對齊方面的靈活性。結果，我們提出了一個多功能視覺基礎模型，在廣泛的設置中無需微調即超越了專門化的最新技術。DINOv3生成的高質量密集特徵在各種視覺任務上表現卓越，顯著超越了先前的自監督和弱監督基礎模型。我們還分享了DINOv3視覺模型套件，旨在通過為多樣的資源限制和部署場景提供可擴展的解決方案，推動廣泛任務和數據上的技術前沿。

English

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.