DraCo：以思维链草稿实现文本到图像预览及稀有概念生成

摘要

近期统一的多模态大语言模型（MLLMs）展现出令人瞩目的能力，通过整合思维链（CoT）推理增强了文本到图像的生成效果。然而现有方法仍存在局限：要么仅将模型视为独立生成器，要么依赖抽象的文本规划。为此，我们提出草稿式思维链（DraCo）——一种新颖的交错推理范式，充分利用CoT中的文本与视觉内容进行更优的规划与验证。该方法首先生成低分辨率草稿图像作为预览，提供更具体、结构化的视觉规划指引；随后调用模型固有的理解能力验证草稿与输入提示间的潜在语义偏差，并通过超分辨率选择性修正进行细化。该方案解决了两个核心挑战：文本规划的粗粒度特性，以及罕见属性组合的生成难题。为支持训练，我们构建了DraCo-240K数据集，旨在提升通用修正、实例操控和布局重组三大基础能力。结合专为交错推理设计的无分类器引导策略DraCo-CFG，本方法在GenEval（+8%）、Imagine-Bench（+0.91）和GenEval++（+3%）上实现显著提升，显著优于直接生成及其他基于CoT的生成方法。

English

Recent unified multimodal large language models (MLLMs) have shown impressive capabilities, incorporating chain-of-thought (CoT) reasoning for enhanced text-to-image generation. However, existing approaches remain limited, either treating the model merely as a standalone generator or relying on abstract textual planning. To this end, we propose Draft-as-CoT (DraCo), a novel interleaved reasoning paradigm that fully leverages both textual and visual contents in CoT for better planning and verification. Our method first generates a low-resolution draft image as preview, providing more concrete and structural visual planning and guidance. Then, we employ the model's inherent understanding capability to verify potential semantic misalignments between the draft and input prompt, and performs refinement through selective corrections with super-resolution. In this way, our approach addresses two fundamental challenges: the coarse-grained nature of textual planning and the difficulty in generating rare attribute combinations. To support training, we curate DraCo-240K, aiming to enhance three atomic capabilities spanning general correction, instance manipulation, and layout reorganization. Supported by DraCo-CFG, a specialized classifier-free guidance (CFG) strategy for interleaved reasoning, DraCo achieves a tremendous increase on GenEval (+8%), Imagine-Bench (+0.91), and GenEval++ (+3%), significantly outperforming direct generation and other generation methods empowered by CoT.