DINOv3

초록

자기 지도 학습(self-supervised learning)은 수동 데이터 주석의 필요성을 없애고, 모델이 대규모 데이터셋과 더 큰 아키텍처로 쉽게 확장될 수 있도록 하는 가능성을 가지고 있습니다. 특정 작업이나 도메인에 맞춰지지 않음으로써, 이 학습 패러다임은 자연 이미지부터 항공 이미지까지 다양한 소스로부터 단일 알고리즘을 사용하여 시각적 표현을 학습할 수 있는 잠재력을 가지고 있습니다. 이 기술 보고서는 간단하지만 효과적인 전략을 활용하여 이러한 비전을 실현하기 위한 주요 이정표인 DINOv3를 소개합니다. 먼저, 데이터 준비, 설계 및 최적화를 통해 데이터셋과 모델 크기의 확장 이점을 활용합니다. 둘째, 긴 학습 스케줄 동안 밀집 특성 맵(dense feature maps)이 저하되는 알려졌지만 해결되지 않은 문제를 효과적으로 해결하는 새로운 방법인 Gram anchoring을 소개합니다. 마지막으로, 해상도, 모델 크기 및 텍스트와의 정렬과 관련하여 모델의 유연성을 더욱 향상시키는 사후 전략(post-hoc strategies)을 적용합니다. 그 결과, 우리는 미세 조정(fine-tuning) 없이도 다양한 설정에서 특화된 최신 기술을 능가하는 다목적 시각 기반 모델(versatile vision foundation model)을 제시합니다. DINOv3는 다양한 시각 작업에서 뛰어난 성능을 달성하는 고품질의 밀집 특성을 생성하며, 이전의 자기 지도 및 약한 지도 기반 모델을 크게 능가합니다. 또한, 우리는 다양한 자원 제약과 배포 시나리오에 대한 확장 가능한 솔루션을 제공함으로써 광범위한 작업과 데이터에 대한 최신 기술을 발전시키기 위해 설계된 DINOv3 시각 모델 제품군을 공유합니다.

English

Self-supervised learning holds the promise of eliminating the need for manual data annotation, enabling models to scale effortlessly to massive datasets and larger architectures. By not being tailored to specific tasks or domains, this training paradigm has the potential to learn visual representations from diverse sources, ranging from natural to aerial images -- using a single algorithm. This technical report introduces DINOv3, a major milestone toward realizing this vision by leveraging simple yet effective strategies. First, we leverage the benefit of scaling both dataset and model size by careful data preparation, design, and optimization. Second, we introduce a new method called Gram anchoring, which effectively addresses the known yet unsolved issue of dense feature maps degrading during long training schedules. Finally, we apply post-hoc strategies that further enhance our models' flexibility with respect to resolution, model size, and alignment with text. As a result, we present a versatile vision foundation model that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models. We also share the DINOv3 suite of vision models, designed to advance the state of the art on a wide spectrum of tasks and data by providing scalable solutions for diverse resource constraints and deployment scenarios.