通过条件对比对齐实现无需引导的增强现实视觉生成

摘要

无分类器引导（CFG）是增强视觉生成模型样本质量的关键技术。然而，在自回归（AR）多模态生成中，CFG引入了语言和视觉内容之间的设计不一致，与统一不同模态的视觉AR设计理念相矛盾。受语言模型对齐方法的启发，我们提出条件对比对齐（CCA）来促进无引导的AR视觉生成，具有高性能，并分析其与引导抽样方法的理论联系。与改变抽样过程以实现理想抽样分布的引导方法不同，CCA直接微调预训练模型以适应相同的分布目标。实验结果表明，CCA可以显著提升所有测试模型的无引导性能，仅需在预训练数据集上微调一个时期（相当于预训练时期的1\%），与引导抽样方法不相上下。这在很大程度上消除了AR视觉生成中引导抽样的需求，并将抽样成本减半。此外，通过调整训练参数，CCA可以在样本多样性和保真度之间实现权衡，类似于CFG。这从实验证实了语言目标对齐和视觉引导方法之间的强大理论联系，统一了两个先前独立的研究领域。代码和模型权重：https://github.com/thu-ml/CCA。

English

Classifier-Free Guidance (CFG) is a critical technique for enhancing the sample quality of visual generative models. However, in autoregressive (AR) multi-modal generation, CFG introduces design inconsistencies between language and visual content, contradicting the design philosophy of unifying different modalities for visual AR. Motivated by language model alignment methods, we propose Condition Contrastive Alignment (CCA) to facilitate guidance-free AR visual generation with high performance and analyze its theoretical connection with guided sampling methods. Unlike guidance methods that alter the sampling process to achieve the ideal sampling distribution, CCA directly fine-tunes pretrained models to fit the same distribution target. Experimental results show that CCA can significantly enhance the guidance-free performance of all tested models with just one epoch of fine-tuning (sim 1\% of pretraining epochs) on the pretraining dataset, on par with guided sampling methods. This largely removes the need for guided sampling in AR visual generation and cuts the sampling cost by half. Moreover, by adjusting training parameters, CCA can achieve trade-offs between sample diversity and fidelity similar to CFG. This experimentally confirms the strong theoretical connection between language-targeted alignment and visual-targeted guidance methods, unifying two previously independent research fields. Code and model weights: https://github.com/thu-ml/CCA.

通过条件对比对齐实现无需引导的增强现实视觉生成

Toward Guidance-Free AR Visual Generation via Condition Contrastive Alignment

摘要

Support