InstanceGen：基于实例级指令的图像生成

摘要

尽管生成模型的能力迅速提升，预训练的文本到图像模型在捕捉由包含多个对象及实例级属性的复杂提示所传达的语义方面仍面临挑战。因此，我们观察到，在引导此类复杂情况下的生成过程中，整合额外结构约束（通常以粗略边界框的形式）的兴趣日益增长。在本研究中，我们将结构引导的理念更进一步，注意到当代图像生成模型能够直接提供一种合理的细粒度结构初始化。我们提出了一种技术，将这种基于图像的结构引导与基于大语言模型（LLM）的实例级指令相结合，从而生成完全遵循文本提示所有部分的输出图像，包括对象数量、实例级属性以及实例间的空间关系。

English

Despite rapid advancements in the capabilities of generative models, pretrained text-to-image models still struggle in capturing the semantics conveyed by complex prompts that compound multiple objects and instance-level attributes. Consequently, we are witnessing growing interests in integrating additional structural constraints, typically in the form of coarse bounding boxes, to better guide the generation process in such challenging cases. In this work, we take the idea of structural guidance a step further by making the observation that contemporary image generation models can directly provide a plausible fine-grained structural initialization. We propose a technique that couples this image-based structural guidance with LLM-based instance-level instructions, yielding output images that adhere to all parts of the text prompt, including object counts, instance-level attributes, and spatial relations between instances.