畫布到影像:基於多模態控制的組合式影像生成
Canvas-to-Image: Compositional Image Generation with Multimodal Controls
November 26, 2025
作者: Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang
cs.AI
摘要
儘管現代擴散模型在生成高品質與多樣化圖像方面表現卓越,但在實現高保真度的組合式與多模態控制時仍面臨挑戰,尤其當使用者需同時指定文字提示、主體參照、空間佈局、姿勢約束和版面標註時。我們提出「畫布到圖像」統一框架,將這些異質控制項整合至單一畫布介面,使使用者能生成精準反映意圖的圖像。其核心思路是將多種控制信號編碼為單一複合畫布圖像,使模型能直接解讀並進行整合式視覺空間推理。我們進一步構建一套多任務資料集,提出「多任務畫布訓練」策略,透過統一學習範式優化擴散模型,使其能共同理解並融合異質控制項至文字生成圖像的流程中。此聯合訓練使「畫布到圖像」能跨多種控制模態進行推理,而非依賴任務特定啟發式方法,並在推論階段對多控制場景展現良好泛化能力。大量實驗表明,「畫布到圖像」在具挑戰性的基準測試(包括多人組合、姿勢控制合成、版面約束生成及多控制生成)中,於身份保持與控制依從性方面顯著優於現有頂尖方法。
English
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.