条件対照的整合を通じたガイダンス不要のARビジュアル生成に向けて

要旨

識別器フリーガイダンス（CFG）は、視覚生成モデルのサンプル品質を向上させるための重要な技術です。ただし、自己回帰（AR）マルチモーダル生成において、CFGは言語と視覚コンテンツの間に設計上の不整合を導入し、視覚ARの異なるモダリティを統一する設計哲学に反することがあります。言語モデルの整列手法に着想を得て、私たちはガイダンスフリーのAR視覚生成を促進するために条件対照的整列（CCA）を提案し、高性能でありながらガイダンス付きサンプリング方法との理論的な関連を分析します。理想的なサンプリング分布を達成するためにサンプリングプロセスを変更するガイダンス手法とは異なり、CCAは事前学習済みモデルを同じ分布ターゲットに適合させるために直接微調整します。実験結果によると、CCAは、事前学習データセットでの微調整（事前学習エポックの約1％）を1エポックだけ行うことで、すべてのテストされたモデルのガイダンスフリー性能を大幅に向上させ、ガイダンス付きサンプリング方法と同等の性能を発揮します。これにより、AR視覚生成におけるガイダンス付きサンプリングの必要性が大幅に低減され、サンプリングコストが半減されます。さらに、トレーニングパラメータを調整することで、CCAはCFGと同様にサンプルの多様性と忠実度のトレードオフを達成できます。これにより、言語ターゲットの整列と視覚ターゲットのガイダンス手法との間に強い理論的な関連が実験的に確認され、以前は独立していた2つの研究分野が統一されました。コードとモデルの重み：https://github.com/thu-ml/CCA.

English

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose Condition Contrastive Alignment (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning (sim 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.