CoCo: テキストから画像へのプレビューと希少概念生成のためのコードとしての思考連鎖

要旨

統合マルチモーダルモデル（UMM）の最近の進展は、特に連鎖思考（CoT）推論の統合を通じて、テキストから画像への生成（T2I）を大幅に発展させてきた。しかし、既存のCoTベースのT2I手法は、複雑な空間配置、構造化された視覚要素、高密度のテキスト内容に必要な精度を欠く、抽象的な自然言語計画に大きく依存している。本研究では、推論プロセスを実行可能なコードとして表現し、画像生成のための明示的かつ検証可能な中間計画を可能にするコード駆動型推論フレームワーク「CoCo（Code-as-CoT）」を提案する。テキストプロンプトが与えられると、CoCoはまずシーンの構造的レイアウトを指定する実行可能なコードを生成し、これをサンドボックス環境で実行して決定論的ドラフト画像をレンダリングする。その後、モデルはこのドラフトを細粒度の画像編集によって洗練し、最終的な高忠実度の結果を生成する。この学習パラダイムを支援するため、構造化ドラフト構築と修正的視覚洗練の両方を学習させるために設計された、構造化されたドラフト-最終画像ペアを含む精選データセットCoCo-10Kを構築した。StructT2IBench、OneIG-Bench、LongText-Benchによる実証的評価では、CoCoが直接生成法に対してそれぞれ+68.83%、+54.8%、+41.23%の改善を達成し、CoTを活用した他の生成手法も凌駕することを示した。これらの結果は、実行可能コードが、精密で制御可能かつ構造化されたテキストから画像への生成のための、効果的かつ信頼性の高い推論パラダイムであることを実証している。コードは以下で公開されている：https://github.com/micky-li-hd/CoCo

English

Recent advancements in Unified Multimodal Models (UMMs) have significantly advanced text-to-image (T2I) generation, particularly through the integration of Chain-of-Thought (CoT) reasoning. However, existing CoT-based T2I methods largely rely on abstract natural-language planning, which lacks the precision required for complex spatial layouts, structured visual elements, and dense textual content. In this work, we propose CoCo (Code-as-CoT), a code-driven reasoning framework that represents the reasoning process as executable code, enabling explicit and verifiable intermediate planning for image generation. Given a text prompt, CoCo first generates executable code that specifies the structural layout of the scene, which is then executed in a sandboxed environment to render a deterministic draft image. The model subsequently refines this draft through fine-grained image editing to produce the final high-fidelity result. To support this training paradigm, we construct CoCo-10K, a curated dataset containing structured draft-final image pairs designed to teach both structured draft construction and corrective visual refinement. Empirical evaluations on StructT2IBench, OneIG-Bench, and LongText-Bench show that CoCo achieves improvements of +68.83%, +54.8%, and +41.23% over direct generation, while also outperforming other generation methods empowered by CoT. These results demonstrate that executable code is an effective and reliable reasoning paradigm for precise, controllable, and structured text-to-image generation. The code is available at: https://github.com/micky-li-hd/CoCo

CoCo: テキストから画像へのプレビューと希少概念生成のためのコードとしての思考連鎖

CoCo: Code as CoT for Text-to-Image Preview and Rare Concept Generation

要旨

Support