ChatPaper.aiChatPaper

DIP:视觉表征的无监督密集上下文后训练

DIP: Unsupervised Dense In-Context Post-training of Visual Representations

June 23, 2025
作者: Sophia Sirko-Galouchenko, Spyros Gidaris, Antonin Vobecky, Andrei Bursuc, Nicolas Thome
cs.AI

摘要

我们提出了一种名为DIP的新型无监督后训练方法,旨在增强大规模预训练视觉编码器中的密集图像表示,以支持上下文场景理解。与以往依赖复杂自蒸馏架构的方法不同,我们的方法受元学习原理启发,通过模拟下游上下文场景的伪任务来训练视觉编码器。为了实现对未标注数据的后训练,我们提出了一种自动生成上下文任务的机制,该机制结合了预训练的扩散模型和视觉编码器本身。DIP方法简洁、无需监督且计算高效,在单块A100 GPU上耗时不足9小时。通过伪上下文任务学习密集表示,它在多种下游现实世界上下文场景理解任务中展现出强劲性能,不仅超越了初始视觉编码器,也优于先前的方法,为提升密集表示提供了一种实用且高效的解决方案。代码已发布于:https://github.com/sirkosophia/DIP。
English
We introduce DIP, a novel unsupervised post-training method designed to enhance dense image representations in large-scale pretrained vision encoders for in-context scene understanding. Unlike prior approaches that rely on complex self-distillation architectures, our method trains the vision encoder using pseudo-tasks that explicitly simulate downstream in-context scenarios, inspired by meta-learning principles. To enable post-training on unlabeled data, we propose an automatic mechanism for generating in-context tasks that combines a pretrained diffusion model and the vision encoder itself. DIP is simple, unsupervised, and computationally efficient, requiring less than 9 hours on a single A100 GPU. By learning dense representations through pseudo in-context tasks, it achieves strong performance across a wide variety of downstream real-world in-context scene understanding tasks. It outperforms both the initial vision encoder and prior methods, offering a practical and effective solution for improving dense representations. Code available here: https://github.com/sirkosophia/DIP
PDF131June 24, 2025