CoCo:以程式碼作為思維鏈的文本到圖像預覽與稀有概念生成技術
CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation
March 9, 2026
作者: Haodong Li, Chunmei Qing, Huanyu Zhang, Dongzhi Jiang, Yihang Zou, Hongbo Peng, Dingming Li, Yuhong Dai, ZePeng Lin, Juanxi Tian, Yi Zhou, Siqi Dai, Jingwei Wu
cs.AI
摘要
近期統一多模態模型的發展顯著推進了文字到圖像生成技術,特別是通過整合思維鏈推理機制。然而,現有基於思維鏈的文本生成圖像方法主要依賴抽象的自然語言規劃,難以精確處理複雜的空間佈局、結構化視覺元素與密集文本內容。本研究提出代碼驅動推理框架CoCo(代碼即思維鏈),將推理過程表示為可執行程式碼,實現可明確驗證的中間規劃環節。給定文本提示後,CoCo首先生成指定場景結構佈局的可執行程式碼,在沙盒環境中執行後生成確定性草圖,隨後通過細粒度圖像編輯進行優化,最終輸出高擬真度結果。為支持此訓練範式,我們構建了包含萬級結構化草圖-成品圖像對的CoCo-10K數據集,用於指導結構化草圖構建與視覺校正優化。在StructT2IBench、OneIG-Bench和LongText-Bench上的實驗表明,CoCo相比直接生成方法分別提升68.83%、54.8%和41.23%,同時優於其他思維鏈增強生成方法。這些結果證明可執行程式碼能作為精確、可控、結構化文本生成圖像的有效推理範式。項目代碼已開源於:https://github.com/micky-li-hd/CoCo
English
Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo