文本轉貼紙：針對人類表情定製風格的潛在擴散模型

摘要

我們介紹了「風格定制」，這是一種微調潛在擴散模型（LDMs）的方法，用於在高視覺質量、即時對齊和場景多樣性的不同領域中。我們選擇貼紙圖像生成作為目標領域，因為這些圖像與通常由大規模LDMs生成的照片逼真樣本有顯著差異。我們從一個具有競爭力的文本到圖像模型開始，比如Emu，並展示依賴通過照片逼真模型生成貼紙的提示工程導致提示對齊和場景多樣性差。為了克服這些缺點，我們首先在使用弱監督收集的數百萬張類似貼紙的圖像上對Emu進行微調，以引出多樣性。接下來，我們從模型生成中精選人機協作（HITL）對齊和風格數據集，並進行微調以分別改善提示對齊和風格對齊。在這些數據集上的順序微調存在更好風格對齊和提示對齊增益之間的權衡。為了應對這種權衡，我們提出了一種新穎的微調方法，稱為風格定制，它共同擬合內容和風格分佈，實現最佳權衡。評估結果顯示，與對Emu基本模型進行提示工程生成貼紙相比，我們的方法將視覺質量提高了14％，提示對齊提高了16.2％，場景多樣性提高了15.3％。

English

We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that relying on prompt engineering with a photorealistic model to generate stickers leads to poor prompt alignment and scene diversity. To overcome these drawbacks, we first finetune Emu on millions of sticker-like images collected using weak supervision to elicit diversity. Next, we curate human-in-the-loop (HITL) Alignment and Style datasets from model generations, and finetune to improve prompt alignment and style alignment respectively. Sequential finetuning on these datasets poses a tradeoff between better style alignment and prompt alignment gains. To address this tradeoff, we propose a novel fine-tuning method called Style Tailoring, which jointly fits the content and style distribution and achieves best tradeoff. Evaluation results show our method improves visual quality by 14%, prompt alignment by 16.2% and scene diversity by 15.3%, compared to prompt engineering the base Emu model for stickers generation.

文本轉貼紙：針對人類表情定製風格的潛在擴散模型

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

摘要

Support