使用GPT-4實現可控的文本到圖像生成

摘要

目前的文本到圖像生成模型常常難以遵循文字指示，尤其是需要空間推理的指示。另一方面，大型語言模型（LLMs），如GPT-4，在生成程式碼片段以圖形方式勾勒出文字輸入方面表現出色，例如透過TikZ。在這項研究中，我們介紹了Control-GPT，以GPT-4生成的程式化草圖引導基於擴散的文本到圖像管道，增強其遵循指示的能力。Control-GPT通過查詢GPT-4撰寫TikZ程式碼，生成的草圖與文本指示一起用作擴散模型（例如ControlNet）生成逼真圖像的參考。訓練我們的管道面臨的一個主要挑戰是缺乏包含對齊文本、圖像和草圖的數據集。我們通過將現有數據集中的實例遮罩轉換為多邊形，以模擬測試時使用的草圖來解決這個問題。因此，Control-GPT大大提升了圖像生成的可控性。它在空間佈局和物體位置生成方面建立了新的技術水準，增強了用戶對物體位置、大小等的控制，幾乎使先前模型的準確性翻倍。我們的研究作為一次首次嘗試，展示了利用LLMs提升計算機視覺任務性能的潛力。

English

Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in generating code snippets for sketching out text inputs graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following. Control-GPT works by querying GPT-4 to write TikZ code, and the generated sketches are used as references alongside the text instructions for diffusion models (e.g., ControlNet) to generate photo-realistic images. One major challenge to training our pipeline is the lack of a dataset containing aligned text, images, and sketches. We address the issue by converting instance masks in existing datasets into polygons to mimic the sketches used at test time. As a result, Control-GPT greatly boosts the controllability of image generation. It establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc., nearly doubling the accuracy of prior models. Our work, as a first attempt, shows the potential for employing LLMs to enhance the performance in computer vision tasks.

使用GPT-4實現可控的文本到圖像生成

Controllable Text-to-Image Generation with GPT-4

摘要

Support