OmniBooth：使用多模态指导学习图像合成的潜在控制

摘要

我们提出了OmniBooth，这是一个图像生成框架，可以实现空间控制并具有实例级多模态定制功能。对于所有实例，多模态指令可以通过文本提示或图像参考来描述。在给定一组用户定义的掩模和相关文本或图像指导的情况下，我们的目标是生成一幅图像，其中多个对象位于指定坐标，并且它们的属性与相应的指导精确对齐。这种方法显著扩展了文本到图像生成的范围，并将其提升到更具多功能性和实用性的可控维度。在本文中，我们的核心贡献在于提出的潜在控制信号，这是一个高维空间特征，提供了一个统一的表示，可以无缝地整合空间、文本和图像条件。文本条件扩展了ControlNet，以提供实例级开放词汇生成。图像条件进一步实现了对个性化身份的细粒度控制。在实践中，我们的方法赋予用户更多的灵活性，因为用户可以根据需要从文本或图像中选择多模态条件。此外，通过彻底的实验，我们展示了在图像合成保真度和在不同任务和数据集上对齐方面的增强性能。项目页面：https://len-li.github.io/omnibooth-web/

English

We present OmniBooth, an image generation framework that enables spatial control with instance-level multi-modal customization. For all instances, the multimodal instruction can be described through text prompts or image references. Given a set of user-defined masks and associated text or image guidance, our objective is to generate an image, where multiple objects are positioned at specified coordinates and their attributes are precisely aligned with the corresponding guidance. This approach significantly expands the scope of text-to-image generation, and elevates it to a more versatile and practical dimension in controllability. In this paper, our core contribution lies in the proposed latent control signals, a high-dimensional spatial feature that provides a unified representation to integrate the spatial, textual, and image conditions seamlessly. The text condition extends ControlNet to provide instance-level open-vocabulary generation. The image condition further enables fine-grained control with personalized identity. In practice, our method empowers users with more flexibility in controllable generation, as users can choose multi-modal conditions from text or images as needed. Furthermore, thorough experiments demonstrate our enhanced performance in image synthesis fidelity and alignment across different tasks and datasets. Project page: https://len-li.github.io/omnibooth-web/