DraCo:以思维链草稿实现文本到图像预览及稀有概念生成
DraCo: Draft as CoT for Text-to-Image Preview and Rare Concept Generation
December 4, 2025
作者: Dongzhi Jiang, Renrui Zhang, Haodong Li, Zhuofan Zong, Ziyu Guo, Jun He, Claire Guo, Junyan Ye, Rongyao Fang, Weijia Li, Rui Liu, Hongsheng Li
cs.AI
摘要
近期统一的多模态大语言模型(MLLMs)展现出令人瞩目的能力,通过整合思维链(CoT)推理机制增强了文本到图像的生成效果。然而现有方法仍存在局限:要么仅将模型视为独立生成器,要么依赖抽象的文本规划。为此,我们提出草案式思维链(DraCo)——一种全新的交错推理范式,充分利用CoT中的文本与视觉内容进行更优的规划与验证。我们的方法首先生成低分辨率草案图像作为预览,提供更具体、更具结构性的视觉规划指引;随后调用模型固有的理解能力,验证草案与输入提示间潜在的语义偏差,并通过选择性修正配合超分辨率技术进行细化。该方案有效解决了文本规划的粗粒度特性与稀有属性组合生成困难两大核心挑战。为支持训练,我们构建了DraCo-240K数据集,旨在提升通用修正、实例操控和布局重组三项原子能力。依托专为交错推理设计的无分类器引导策略DraCo-CFG,本方法在GenEval(+8%)、Imagine-Bench(+0.91)和GenEval++(+3%)指标上实现显著提升,显著超越直接生成及其他基于CoT的生成方法。
English
Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.