DIP: 비지도 방식의 밀집 인-컨텍스트 시각적 표현 사후 훈련

초록

본 논문에서는 대규모로 사전 학습된 비전 인코더의 밀집 이미지 표현을 향상시키기 위해, 컨텍스트 내 장면 이해를 목표로 하는 새로운 비지도 사후 학습 방법인 DIP(Dense Image Post-training)를 소개한다. 기존의 복잡한 자기-증류(self-distillation) 아키텍처에 의존하는 방법과 달리, 본 방법은 메타 학습 원리에 영감을 받아 다운스트림 컨텍스트 내 시나리오를 명시적으로 시뮬레이션하는 가상 작업(pseudo-task)을 통해 비전 인코더를 학습한다. 레이블이 없는 데이터에 대한 사후 학습을 가능하게 하기 위해, 사전 학습된 확산 모델(diffusion model)과 비전 인코더 자체를 결합한 컨텍스트 내 작업 생성 메커니즘을 제안한다. DIP는 단순하고 비지도 방식이며, 단일 A100 GPU에서 9시간 미만의 계산 효율성을 보인다. 가상 컨텍스트 내 작업을 통해 밀집 표현을 학습함으로써, 다양한 다운스트림 실제 세계의 컨텍스트 내 장면 이해 작업에서 강력한 성능을 달성한다. 이는 초기 비전 인코더와 기존 방법을 모두 능가하며, 밀집 표현을 개선하기 위한 실용적이고 효과적인 해결책을 제공한다. 코드는 다음 링크에서 확인할 수 있다: https://github.com/sirkosophia/DIP

English

We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP

DIP: 비지도 방식의 밀집 인-컨텍스트 시각적 표현 사후 훈련

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

초록

Support