使用GPT-4进行可控文本到图像生成

摘要

目前的文本到图像生成模型通常难以遵循文本指令，特别是那些需要空间推理的指令。另一方面，大型语言模型（LLMs），如GPT-4，在为文本输入生成代码片段方面表现出了卓越的精度，例如通过TikZ进行图形化草图。在这项工作中，我们引入Control-GPT来指导基于扩散的文本到图像流程，使用GPT-4生成的程序化草图来增强其遵循指令的能力。Control-GPT通过查询GPT-4编写TikZ代码，生成的草图与文本指令一起用作扩散模型（例如ControlNet）生成逼真图像的参考。训练我们的流程面临的一个主要挑战是缺乏包含对齐文本、图像和草图的数据集。我们通过将现有数据集中的实例掩模转换为多边形来模仿测试时使用的草图，以解决这个问题。因此，Control-GPT极大地提升了图像生成的可控性。它在空间布局和对象位置生成方面确立了新的技术水平，并增强了用户对对象位置、大小等的控制，几乎使先前模型的准确性翻倍。我们的工作作为一次首次尝试，展示了利用LLMs增强计算机视觉任务性能的潜力。

English

Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in generating code snippets for sketching out text inputs graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following. Control-GPT works by querying GPT-4 to write TikZ code, and the generated sketches are used as references alongside the text instructions for diffusion models (e.g., ControlNet) to generate photo-realistic images. One major challenge to training our pipeline is the lack of a dataset containing aligned text, images, and sketches. We address the issue by converting instance masks in existing datasets into polygons to mimic the sketches used at test time. As a result, Control-GPT greatly boosts the controllability of image generation. It establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc., nearly doubling the accuracy of prior models. Our work, as a first attempt, shows the potential for employing LLMs to enhance the performance in computer vision tasks.

使用GPT-4进行可控文本到图像生成

Controllable Text-to-Image Generation with GPT-4

摘要

Support