使用GPT-4實現可控的文本到圖像生成
Controllable Text-to-Image Generation with GPT-4
May 29, 2023
作者: Tianjun Zhang, Yi Zhang, Vibhav Vineet, Neel Joshi, Xin Wang
cs.AI
摘要
目前的文本到圖像生成模型常常難以遵循文字指示,尤其是需要空間推理的指示。另一方面,大型語言模型(LLMs),如GPT-4,在生成程式碼片段以圖形方式勾勒出文字輸入方面表現出色,例如透過TikZ。在這項研究中,我們介紹了Control-GPT,以GPT-4生成的程式化草圖引導基於擴散的文本到圖像管道,增強其遵循指示的能力。Control-GPT通過查詢GPT-4撰寫TikZ程式碼,生成的草圖與文本指示一起用作擴散模型(例如ControlNet)生成逼真圖像的參考。訓練我們的管道面臨的一個主要挑戰是缺乏包含對齊文本、圖像和草圖的數據集。我們通過將現有數據集中的實例遮罩轉換為多邊形,以模擬測試時使用的草圖來解決這個問題。因此,Control-GPT大大提升了圖像生成的可控性。它在空間佈局和物體位置生成方面建立了新的技術水準,增強了用戶對物體位置、大小等的控制,幾乎使先前模型的準確性翻倍。我們的研究作為一次首次嘗試,展示了利用LLMs提升計算機視覺任務性能的潛力。
English
Current text-to-image generation models often struggle to follow textual
instructions, especially the ones requiring spatial reasoning. On the other
hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable
precision in generating code snippets for sketching out text inputs
graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide
the diffusion-based text-to-image pipelines with programmatic sketches
generated by GPT-4, enhancing their abilities for instruction following.
Control-GPT works by querying GPT-4 to write TikZ code, and the generated
sketches are used as references alongside the text instructions for diffusion
models (e.g., ControlNet) to generate photo-realistic images. One major
challenge to training our pipeline is the lack of a dataset containing aligned
text, images, and sketches. We address the issue by converting instance masks
in existing datasets into polygons to mimic the sketches used at test time. As
a result, Control-GPT greatly boosts the controllability of image generation.
It establishes a new state-of-art on the spatial arrangement and object
positioning generation and enhances users' control of object positions, sizes,
etc., nearly doubling the accuracy of prior models. Our work, as a first
attempt, shows the potential for employing LLMs to enhance the performance in
computer vision tasks.