通過正交微調控制文本到圖像的擴散

摘要

大型文本到圖像擴散模型在從文本提示生成逼真圖像方面具有令人印象深刻的能力。如何有效地引導或控制這些強大模型以執行不同下游任務成為一個重要的開放問題。為應對這一挑戰，我們引入了一種原則性的微調方法——正交微調（OFT），用於使文本到圖像擴散模型適應下游任務。與現有方法不同，OFT 可以證明地保留特徵對能量，該特徵描述了單位超球面上的成對神經元關係。我們發現這種特性對於保留文本到圖像擴散模型的語義生成能力至關重要。為了提高微調穩定性，我們進一步提出了約束正交微調（COFT），它對超球面施加了額外的半徑約束。具體而言，我們考慮了兩個重要的微調文本到圖像任務：主題驅動生成，目標是在給定主題的幾張圖像和文本提示的情況下生成特定主題的圖像，以及可控生成，目標是使模型接收額外的控制信號。我們在實驗中展示，我們的OFT 框架在生成質量和收斂速度方面優於現有方法。

English

Large text-to-image diffusion models have impressive capabilities in generating photorealistic images from text prompts. How to effectively guide or control these powerful models to perform different downstream tasks becomes an important open problem. To tackle this challenge, we introduce a principled finetuning method -- Orthogonal Finetuning (OFT), for adapting text-to-image diffusion models to downstream tasks. Unlike existing methods, OFT can provably preserve hyperspherical energy which characterizes the pairwise neuron relationship on the unit hypersphere. We find that this property is crucial for preserving the semantic generation ability of text-to-image diffusion models. To improve finetuning stability, we further propose Constrained Orthogonal Finetuning (COFT) which imposes an additional radius constraint to the hypersphere. Specifically, we consider two important finetuning text-to-image tasks: subject-driven generation where the goal is to generate subject-specific images given a few images of a subject and a text prompt, and controllable generation where the goal is to enable the model to take in additional control signals. We empirically show that our OFT framework outperforms existing methods in generation quality and convergence speed.

通過正交微調控制文本到圖像的擴散

Controlling Text-to-Image Diffusion by Orthogonal Finetuning

摘要

Support