GenHancer: 不完全な生成モデルが秘める強力な視覚中心エンハンサー

要旨

生成モデルと識別モデルの相乗効果が注目を集めています。識別モデルであるContrastive Language-Image Pre-Training (CLIP)は高レベルのセマンティクスにおいて優れていますが、細かな視覚的ディテールの認識には苦戦しています。一般的に、表現を強化するために、生成モデルはCLIPの視覚的特徴を再構築の条件として利用します。しかし、その基本原理はまだ十分に探究されていません。本研究では、視覚的に完璧な生成が必ずしも表現強化に最適ではないことを実証的に発見しました。本質は、生成モデルから細かな知識を効果的に抽出しつつ、無関係な情報を軽減することにあります。重要な要因を探るために、以下の3つの側面に深く掘り下げました：(1) 条件付けメカニズム：少数のローカルトークンでも再構築の難易度を大幅に低下させ、訓練の崩壊を引き起こすことがわかりました。そのため、グローバルな視覚トークンのみを条件として利用することが最も効果的な戦略であると結論付けました。(2) ノイズ除去設定：エンドツーエンドの訓練は余分な情報を導入することが観察されました。これに対処するために、有用な視覚的知識を優先的に学習するための2段階の訓練戦略を提案しました。さらに、軽量なノイズ除去器が顕著な改善をもたらすことを実証しました。(3) 生成パラダイム：連続的および離散的なノイズ除去器の両方を探索し、望ましい結果を得ることで、本手法の汎用性を検証しました。これらの深い探求を通じて、最終的にGenHancerという効果的な手法に到達しました。この手法はMMVP-VLMベンチマークにおいて、例えばOpenAICLIPで6.0%の向上を示し、従来の手法を一貫して上回りました。強化されたCLIPは、視覚中心の性能を向上させるためにマルチモーダル大規模言語モデルにさらに組み込むことができます。すべてのモデルとコードは公開されています。

English

The synergy between generative and discriminative models receives growing attention. While discriminative Contrastive Language-Image Pre-Training (CLIP) excels in high-level semantics, it struggles with perceiving fine-grained visual details. Generally, to enhance representations, generative models take CLIP's visual features as conditions for reconstruction. However, the underlying principle remains underexplored. In this work, we empirically found that visually perfect generations are not always optimal for representation enhancement. The essence lies in effectively extracting fine-grained knowledge from generative models while mitigating irrelevant information. To explore critical factors, we delve into three aspects: (1) Conditioning mechanisms: We found that even a small number of local tokens can drastically reduce the difficulty of reconstruction, leading to collapsed training. We thus conclude that utilizing only global visual tokens as conditions is the most effective strategy. (2) Denoising configurations: We observed that end-to-end training introduces extraneous information. To address this, we propose a two-stage training strategy to prioritize learning useful visual knowledge. Additionally, we demonstrate that lightweight denoisers can yield remarkable improvements. (3) Generation paradigms: We explore both continuous and discrete denoisers with desirable outcomes, validating the versatility of our method. Through our in-depth explorations, we have finally arrived at an effective method, namely GenHancer, which consistently outperforms prior arts on the MMVP-VLM benchmark, e.g., 6.0% on OpenAICLIP. The enhanced CLIP can be further plugged into multimodal large language models for better vision-centric performance. All the models and codes are made publicly available.

GenHancer: 不完全な生成モデルが秘める強力な視覚中心エンハンサー

GenHancer: Imperfect Generative Models are Secretly Strong Vision-Centric Enhancers

要旨

Support