CoCo:以代码即思维链实现文本到图像预览及稀有概念生成
CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation
March 9, 2026
作者: Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, Yi Zhou, Siqi Dai, Jingwei Wu
cs.AI
摘要
近期,统一多模态模型(UMMs)的进展显著推动了文本到图像(T2I)生成技术,尤其是通过整合思维链(CoT)推理机制。然而,现有基于CoT的T2I方法主要依赖抽象的自然语言规划,难以满足复杂空间布局、结构化视觉元素和密集文本内容所需的精确性。本文提出CoCo(代码化思维链)框架,通过将推理过程表示为可执行代码,实现图像生成过程中显式且可验证的中间规划。给定文本提示时,CoCo首先生成可执行代码来定义场景的结构化布局,随后在沙箱环境中执行代码生成确定性草图图像,最后通过细粒度图像编辑对草图进行优化以生成高保真结果。为支持该训练范式,我们构建了CoCo-10K数据集,包含精心设计的结构化草图-成品图像对,用于指导模型学习结构化草图构建与视觉校正优化。在StructT2IBench、OneIG-Bench和LongText-Bench上的实验表明,CoCo相比直接生成方法分别提升68.83%、54.8%和41.23%,同时优于其他基于CoT的生成方法。这些结果证明,可执行代码是一种有效可靠的推理范式,能够实现精准、可控且结构化的文本到图像生成。代码已开源:https://github.com/micky-li-hd/CoCo
English
Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo