GenHancer：不完美的生成模型實為隱藏的強者視覺核心增強器

摘要

生成模型與判別模型之間的協同效應日益受到關注。雖然判別式的對比語言-圖像預訓練（CLIP）在高層語義理解上表現出色，但在感知細粒度視覺細節方面卻存在困難。通常，為了增強表徵能力，生成模型會將CLIP的視覺特徵作為重建的條件。然而，其背後的原理仍未得到充分探索。在本研究中，我們通過實證發現，視覺上完美的生成並不總是表徵增強的最佳選擇。關鍵在於有效地從生成模型中提取細粒度知識，同時抑制不相關信息。為探討關鍵因素，我們深入研究了三個方面：(1) 條件機制：我們發現，即使少量的局部標記也能大幅降低重建難度，導致訓練崩潰。因此，我們得出結論，僅使用全局視覺標記作為條件是最有效的策略。(2) 去噪配置：我們觀察到端到端訓練會引入額外信息。為解決這一問題，我們提出了一種兩階段訓練策略，以優先學習有用的視覺知識。此外，我們證明輕量級去噪器可以帶來顯著的改進。(3) 生成範式：我們探索了連續和離散去噪器，均取得了理想結果，驗證了我們方法的通用性。通過這些深入探索，我們最終提出了一種名為GenHancer的有效方法，該方法在MMVP-VLM基準測試中持續超越先前技術，例如在OpenAICLIP上提升了6.0%。增強後的CLIP可進一步插入多模態大語言模型中，以提升視覺中心的性能。所有模型和代碼均已公開。

English

The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

GenHancer：不完美的生成模型實為隱藏的強者視覺核心增強器

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

摘要

Support