文本到贴纸：为人类表情定制风格的潜在扩散模型

摘要

我们介绍了Style Tailoring，这是一种在具有高视觉质量、快速对齐和场景多样性的不同领域中微调潜在扩散模型（LDMs）的方法。我们选择贴纸图像生成作为目标领域，因为这些图像与通常由大规模LDMs生成的逼真样本有显著差异。我们首先使用像Emu这样的高效文本到图像模型，并展示依赖逼真模型进行贴纸生成会导致提示对齐和场景多样性不佳。为了克服这些缺点，我们首先使用弱监督收集的数百万张类似贴纸的图像对Emu进行微调，以引出多样性。接下来，我们从模型生成中筛选出人机对齐（HITL）和风格数据集，并分别进行微调以改善提示对齐和风格对齐。在这些数据集上的顺序微调存在更好的风格对齐和提示对齐收益之间的权衡。为了解决这种权衡，我们提出了一种称为Style Tailoring的新型微调方法，它共同适应内容和风格分布，并实现最佳权衡。评估结果显示，与对Emu基础模型进行贴纸生成的提示工程相比，我们的方法将视觉质量提高了14％，提示对齐提高了16.2％，场景多样性提高了15.3％。

English

We introduce Style Tailoring, a recipe to finetune Latent Diffusion Models (LDMs) in a distinct domain with high visual quality, prompt alignment and scene diversity. We choose sticker image generation as the target domain, as the images significantly differ from photorealistic samples typically generated by large-scale LDMs. We start with a competent text-to-image model, like Emu, and show that relying on prompt engineering with a photorealistic model to generate stickers leads to poor prompt alignment and scene diversity. To overcome these drawbacks, we first finetune Emu on millions of sticker-like images collected using weak supervision to elicit diversity. Next, we curate human-in-the-loop (HITL) Alignment and Style datasets from model generations, and finetune to improve prompt alignment and style alignment respectively. Sequential finetuning on these datasets poses a tradeoff between better style alignment and prompt alignment gains. To address this tradeoff, we propose a novel fine-tuning method called Style Tailoring, which jointly fits the content and style distribution and achieves best tradeoff. Evaluation results show our method improves visual quality by 14%, prompt alignment by 16.2% and scene diversity by 15.3%, compared to prompt engineering the base Emu model for stickers generation.

文本到贴纸：为人类表情定制风格的潜在扩散模型

Text-to-Sticker: Style Tailoring Latent Diffusion Models for Human Expression

摘要

Support