GenHancer：不完美的生成模型实为强大的视觉增强器

摘要

生成模型与判别模型之间的协同效应正受到越来越多的关注。尽管判别式的对比语言-图像预训练（CLIP）在高层语义理解上表现出色，但在感知细粒度视觉细节方面却存在局限。通常，为了增强表征能力，生成模型会将CLIP的视觉特征作为重建的条件。然而，其背后的原理尚未得到充分探索。在本研究中，我们通过实证发现，视觉上完美的生成并不总是表征增强的最佳选择。关键在于从生成模型中有效提取细粒度知识，同时过滤无关信息。为探究关键因素，我们从三个方面展开深入分析：（1）条件机制：我们发现，即使是少量的局部标记也能显著降低重建难度，导致训练崩溃。因此，我们得出结论，仅使用全局视觉标记作为条件是最有效的策略。（2）去噪配置：我们观察到端到端训练会引入冗余信息。为此，我们提出了一种两阶段训练策略，优先学习有用的视觉知识。此外，我们证明了轻量级去噪器也能带来显著改进。（3）生成范式：我们探索了连续与离散去噪器，均取得了理想结果，验证了方法的通用性。通过深入探索，我们最终开发出了一种有效的方法——GenHancer，在MMVP-VLM基准测试中持续超越现有技术，例如在OpenAICLIP上提升了6.0%。增强后的CLIP可进一步集成到多模态大语言模型中，以提升以视觉为中心的性能。所有模型与代码均已公开。

English

The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

GenHancer：不完美的生成模型实为强大的视觉增强器

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

摘要

Support