DINOv3

摘要

自监督学习有望消除手动数据标注的需求，使模型能够轻松扩展到海量数据集和更大规模的架构。由于不针对特定任务或领域进行定制，这一训练范式有潜力通过单一算法从多样化的来源（从自然图像到航拍图像）中学习视觉表示。本技术报告介绍了DINOv3，这是实现这一愿景的重要里程碑，它采用了简单而有效的策略。首先，我们通过精心准备、设计和优化数据，充分利用了数据集和模型规模扩展的优势。其次，我们引入了一种名为Gram锚定的新方法，有效解决了在长时间训练过程中密集特征图退化这一已知但未解决的问题。最后，我们应用了事后策略，进一步增强了模型在分辨率、模型大小以及与文本对齐方面的灵活性。因此，我们提出了一种多功能的视觉基础模型，在无需微调的情况下，在广泛的设置中超越了专门化的最先进模型。DINOv3生成的高质量密集特征在各种视觉任务上表现出色，显著超越了之前的自监督和弱监督基础模型。我们还分享了DINOv3视觉模型套件，旨在通过为不同的资源限制和部署场景提供可扩展的解决方案，推动广泛任务和数据上的技术前沿。

English

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.