DIP:视觉表征的无监督密集上下文后训练
DIP: Unsupervised Dense In-Context Post-training of Visual Representations
June 23, 2025
作者: Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome
cs.AI
摘要
我們提出了一種名為DIP的新型無監督後訓練方法,旨在增強大規模預訓練視覺編碼器中的密集圖像表徵,以實現上下文場景理解。與以往依賴複雜自蒸餾架構的方法不同,我們的方法受元學習原理啟發,通過模擬下游上下文場景的偽任務來訓練視覺編碼器。為了在無標籤數據上進行後訓練,我們提出了一種自動生成上下文任務的機制,該機制結合了預訓練的擴散模型和視覺編碼器本身。DIP方法簡單、無監督且計算效率高,在單個A100 GPU上僅需不到9小時即可完成。通過偽上下文任務學習密集表徵,DIP在多種下游現實世界上下文場景理解任務中表現出色,超越了初始視覺編碼器及先前的方法,為改進密集表徵提供了一種實用且有效的解決方案。代碼可在此處獲取:https://github.com/sirkosophia/DIP。
English
We introduce DIP, a novel unsupervised post-training method designed to
enhance dense image representations in large-scale pretrained vision encoders
for in-context scene understanding. Unlike prior approaches that rely on
complex self-distillation architectures, our method trains the vision encoder
using pseudo-tasks that explicitly simulate downstream in-context scenarios,
inspired by meta-learning principles. To enable post-training on unlabeled
data, we propose an automatic mechanism for generating in-context tasks that
combines a pretrained diffusion model and the vision encoder itself. DIP is
simple, unsupervised, and computationally efficient, requiring less than 9
hours on a single A100 GPU. By learning dense representations through pseudo
in-context tasks, it achieves strong performance across a wide variety of
downstream real-world in-context scene understanding tasks. It outperforms both
the initial vision encoder and prior methods, offering a practical and
effective solution for improving dense representations. Code available here:
https://github.com/sirkosophia/DIP