DreamOmni3:基于涂鸦的编辑与生成
DreamOmni3: Scribble-based Editing and Generation
December 27, 2025
作者: Bin Xia, Bohao Peng, Jiyang Liu, Sitong Wu, Jingyao Li, Junjia Huang, Xu Zhao, Yitong Wang, Ruihang Chu, Bei Yu, Jiaya Jia
cs.AI
摘要
近期,生成与编辑统一模型凭借其卓越性能取得了显著成功。这类模型主要依赖文本提示进行基于指令的编辑与生成,但语言往往难以准确传达用户期望的编辑区域和细粒度视觉细节。为此,我们提出两项新任务:基于涂鸦的编辑与生成,通过结合用户文本、图像和手绘草图在图形界面实现更灵活的创作。我们推出DreamOmni3框架,重点解决数据构建与框架设计两大挑战。
我们的数据合成流程包含涂鸦编辑与涂鸦生成两部分。针对涂鸦编辑,我们定义了四项任务:基于涂鸦与指令的编辑、基于涂鸦与多模态指令的编辑、图像融合以及涂鸦修改。基于DreamOmni2数据集,我们提取可编辑区域并叠加手绘方框、圆形、涂鸦或裁剪图像来构建训练数据。对于涂鸦生成任务,我们定义了基于涂鸦与指令的生成、基于涂鸦与多模态指令的生成及涂鸦生成三项任务,并采用类似的数据构建流程。
在框架设计上,针对传统二值掩码难以处理多涂鸦、多图像与复杂指令联合编辑的局限,我们提出联合输入方案:将原始图像与涂鸦标记图像同时输入模型,通过不同颜色区分区域以简化处理。通过对两幅图像施加相同的索引与位置编码,模型能精确定位涂鸦区域并保持编辑准确性。最后,我们为这些任务建立了完整基准测试体系以推动后续研究。实验结果表明DreamOmni3性能优异,模型与代码将公开释放。
English
Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.