ChatPaper.aiChatPaper

使用GPT-4實現可控的文本到圖像生成

Controllable Text-to-Image Generation with GPT-4

May 29, 2023
作者: Tianjun Zhang, Yi Zhang, Vibhav Vineet, Neel Joshi, Xin Wang
cs.AI

摘要

目前的文本到圖像生成模型常常難以遵循文字指示,尤其是需要空間推理的指示。另一方面,大型語言模型(LLMs),如GPT-4,在生成程式碼片段以圖形方式勾勒出文字輸入方面表現出色,例如透過TikZ。在這項研究中,我們介紹了Control-GPT,以GPT-4生成的程式化草圖引導基於擴散的文本到圖像管道,增強其遵循指示的能力。Control-GPT通過查詢GPT-4撰寫TikZ程式碼,生成的草圖與文本指示一起用作擴散模型(例如ControlNet)生成逼真圖像的參考。訓練我們的管道面臨的一個主要挑戰是缺乏包含對齊文本、圖像和草圖的數據集。我們通過將現有數據集中的實例遮罩轉換為多邊形,以模擬測試時使用的草圖來解決這個問題。因此,Control-GPT大大提升了圖像生成的可控性。它在空間佈局和物體位置生成方面建立了新的技術水準,增強了用戶對物體位置、大小等的控制,幾乎使先前模型的準確性翻倍。我們的研究作為一次首次嘗試,展示了利用LLMs提升計算機視覺任務性能的潛力。
English
Current text-to-image generation models often struggle to follow textual instructions, especially the ones requiring spatial reasoning. On the other hand, Large Language Models (LLMs), such as GPT-4, have shown remarkable precision in generating code snippets for sketching out text inputs graphically, e.g., via TikZ. In this work, we introduce Control-GPT to guide the diffusion-based text-to-image pipelines with programmatic sketches generated by GPT-4, enhancing their abilities for instruction following. Control-GPT works by querying GPT-4 to write TikZ code, and the generated sketches are used as references alongside the text instructions for diffusion models (e.g., ControlNet) to generate photo-realistic images. One major challenge to training our pipeline is the lack of a dataset containing aligned text, images, and sketches. We address the issue by converting instance masks in existing datasets into polygons to mimic the sketches used at test time. As a result, Control-GPT greatly boosts the controllability of image generation. It establishes a new state-of-art on the spatial arrangement and object positioning generation and enhances users' control of object positions, sizes, etc., nearly doubling the accuracy of prior models. Our work, as a first attempt, shows the potential for employing LLMs to enhance the performance in computer vision tasks.
PDF31December 15, 2024