ChatPaper.aiChatPaper

画布到图像:基于多模态控制的组合式图像生成

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

November 26, 2025
作者: Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang
cs.AI

摘要

尽管现代扩散模型在生成高质量、多样化图像方面表现出色,但在实现高保真度的组合式与多模态控制方面仍存在挑战,特别是当用户需要同时指定文本提示、主体参照、空间布局、姿态约束和版式标注时。我们推出Canvas-to-Image这一统一框架,将异构控制信号整合至单一画布界面,使用户能够生成精准反映创作意图的图像。其核心创新在于将多样控制信号编码为复合画布图像,使模型能够直接进行视觉空间推理。我们进一步构建了多任务数据集,并提出多任务画布训练策略,通过统一学习范式优化扩散模型对异构控制信号的理解与整合能力。这种联合训练使Canvas-to-Image能够跨多控制模态进行推理,而非依赖任务特定启发式方法,并在推理阶段对多控制场景展现出优秀泛化能力。大量实验表明,在多人组合、姿态控制合成、布局约束生成及多控制生成等挑战性基准测试中,Canvas-to-Image在身份保持与控制遵循度方面显著优于现有最优方法。
English
While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.
PDF265December 1, 2025