MagicQuillV2:基于分层视觉提示的精准交互式图像编辑系统
MagicQuillV2: Precise and Interactive Image Editing with Layered Visual Cues
December 2, 2025
作者: Zichen Liu, Yue Yu, Hao Ouyang, Qiuyu Wang, Shuailei Ma, Ka Leong Cheng, Wen Wang, Qingyan Bai, Yuxuan Zhang, Yanhong Zeng, Yixuan Li, Xing Zhu, Yujun Shen, Qifeng Chen
cs.AI
摘要
我们提出MagicQuill V2这一创新系统,通过引入分层组合范式到生成式图像编辑领域,成功弥合了扩散模型语义生成能力与传统图形软件精细化控制之间的鸿沟。尽管扩散变换器在整体生成方面表现出色,但其使用的单一整体式提示词无法区分用户对内容、位置和外观的不同创作意图。为此,我们的方法将创作意图解构为可控视觉线索堆栈:内容层定义生成对象,空间层确定布局位置,结构层控制形态特征,色彩层掌管配色方案。技术贡献包括:面向上下文感知内容整合的专用数据生成流程、处理所有视觉线索的统一控制模块,以及支持精确局部编辑(含对象移除)的微调空间分支。大量实验证明,这种分层方法能有效解决用户意图偏差问题,赋予创作者对生成过程的直接直观控制能力。
English
We propose MagicQuill V2, a novel system that introduces a layered composition paradigm to generative image editing, bridging the gap between the semantic power of diffusion models and the granular control of traditional graphics software. While diffusion transformers excel at holistic generation, their use of singular, monolithic prompts fails to disentangle distinct user intentions for content, position, and appearance. To overcome this, our method deconstructs creative intent into a stack of controllable visual cues: a content layer for what to create, a spatial layer for where to place it, a structural layer for how it is shaped, and a color layer for its palette. Our technical contributions include a specialized data generation pipeline for context-aware content integration, a unified control module to process all visual cues, and a fine-tuned spatial branch for precise local editing, including object removal. Extensive experiments validate that this layered approach effectively resolves the user intention gap, granting creators direct, intuitive control over the generative process.