DoraCycle:多模态循环中面向领域的统一生成模型适配
DoraCycle: Domain-Oriented Adaptation of Unified Generative Model in Multimodal Cycles
March 5, 2025
作者: Rui Zhao, Weijia Mao, Mike Zheng Shou
cs.AI
摘要
針對特定領域調整生成模型提供了一種滿足專業需求的有效解決方案。然而,適應某些複雜領域仍然具有挑戰性,尤其是當這些領域需要大量配對數據來捕捉目標分佈時。由於來自單一模態(如視覺或語言)的非配對數據更易獲取,我們利用統一生成模型學習到的視覺與語言之間的雙向映射,實現了基於非配對數據的領域適應訓練。具體而言,我們提出了DoraCycle,它整合了兩個多模態循環:文本到圖像再到文本,以及圖像到文本再到圖像。該模型通過在循環終點計算的交叉熵損失進行優化,其中兩個終點共享同一模態。這促進了模型的自進化,無需依賴註釋的文本-圖像對。實驗結果表明,對於獨立於配對知識的任務(如風格化),DoraCycle能夠僅使用非配對數據有效適應統一模型。對於涉及新配對知識的任務(如特定身份),結合少量配對圖像-文本示例和大規模非配對數據,足以實現有效的領域導向適應。代碼將發佈於https://github.com/showlab/DoraCycle。
English
Adapting generative models to specific domains presents an effective solution
for satisfying specialized requirements. However, adapting to some complex
domains remains challenging, especially when these domains require substantial
paired data to capture the targeted distributions. Since unpaired data from a
single modality, such as vision or language, is more readily available, we
utilize the bidirectional mappings between vision and language learned by the
unified generative model to enable training on unpaired data for domain
adaptation. Specifically, we propose DoraCycle, which integrates two multimodal
cycles: text-to-image-to-text and image-to-text-to-image. The model is
optimized through cross-entropy loss computed at the cycle endpoints, where
both endpoints share the same modality. This facilitates self-evolution of the
model without reliance on annotated text-image pairs. Experimental results
demonstrate that for tasks independent of paired knowledge, such as
stylization, DoraCycle can effectively adapt the unified model using only
unpaired data. For tasks involving new paired knowledge, such as specific
identities, a combination of a small set of paired image-text examples and
larger-scale unpaired data is sufficient for effective domain-oriented
adaptation. The code will be released at https://github.com/showlab/DoraCycle.Summary
AI-Generated Summary