DINOv3
DINOv3
August 13, 2025
作者: Oriane Siméoni, Huy V. Vo, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julien Mairal, Hervé Jégou, Patrick Labatut, Piotr Bojanowski
cs.AI
摘要
自监督学习有望消除手动数据标注的需求,使模型能够轻松扩展到海量数据集和更大规模的架构。由于不针对特定任务或领域进行定制,这一训练范式有潜力通过单一算法从多样化的来源(从自然图像到航拍图像)中学习视觉表示。本技术报告介绍了DINOv3,这是实现这一愿景的重要里程碑,它采用了简单而有效的策略。首先,我们通过精心准备、设计和优化数据,充分利用了数据集和模型规模扩展的优势。其次,我们引入了一种名为Gram锚定的新方法,有效解决了在长时间训练过程中密集特征图退化这一已知但未解决的问题。最后,我们应用了事后策略,进一步增强了模型在分辨率、模型大小以及与文本对齐方面的灵活性。因此,我们提出了一种多功能的视觉基础模型,在无需微调的情况下,在广泛的设置中超越了专门化的最先进模型。DINOv3生成的高质量密集特征在各种视觉任务上表现出色,显著超越了之前的自监督和弱监督基础模型。我们还分享了DINOv3视觉模型套件,旨在通过为不同的资源限制和部署场景提供可扩展的解决方案,推动广泛任务和数据上的技术前沿。
English
Self-supervised learning holds the promise of eliminating the need for manual
data annotation, enabling models to scale effortlessly to massive datasets and
larger architectures. By not being tailored to specific tasks or domains, this
training paradigm has the potential to learn visual representations from
diverse sources, ranging from natural to aerial images -- using a single
algorithm. This technical report introduces DINOv3, a major milestone toward
realizing this vision by leveraging simple yet effective strategies. First, we
leverage the benefit of scaling both dataset and model size by careful data
preparation, design, and optimization. Second, we introduce a new method called
Gram anchoring, which effectively addresses the known yet unsolved issue of
dense feature maps degrading during long training schedules. Finally, we apply
post-hoc strategies that further enhance our models' flexibility with respect
to resolution, model size, and alignment with text. As a result, we present a
versatile vision foundation model that outperforms the specialized state of the
art across a broad range of settings, without fine-tuning. DINOv3 produces
high-quality dense features that achieve outstanding performance on various
vision tasks, significantly surpassing previous self- and weakly-supervised
foundation models. We also share the DINOv3 suite of vision models, designed to
advance the state of the art on a wide spectrum of tasks and data by providing
scalable solutions for diverse resource constraints and deployment scenarios.