ChatPaper.aiChatPaper

DreamOmni3:基於塗鴉的編輯與生成系統

DreamOmni3: Scribble-based Editing and Generation

December 27, 2025
作者: Bin Xia, Bohao Peng, Jiyang Liu, Sitong Wu, Jingyao Li, Junjia Huang, Xu Zhao, Yitong Wang, Ruihang Chu, Bei Yu, Jiaya Jia
cs.AI

摘要

近期,統一化的生成與編輯模型憑藉其卓越性能取得了顯著成功。這些模型主要依賴文本提示進行基於指令的編輯與生成,但語言往往難以準確傳達用戶意圖的編輯區域及細粒度視覺細節。為此,我們提出兩項任務:基於塗鴉的編輯與生成,通過結合用戶文本、圖像和手繪草圖,在圖形用戶界面實現更靈活的創作。我們推出DreamOmni3框架,重點解決兩大挑戰:數據構建與框架設計。我們的數據合成流程包含兩部分:塗鴉編輯與塗鴉生成。針對塗鴉編輯,我們定義了四類任務:基於塗鴉與指令的編輯、基於塗鴉與多模態指令的編輯、圖像融合及隨手畫編輯。基於DreamOmni2數據集,我們提取可編輯區域並疊加手繪方框、圓形、塗鴉或裁剪圖像來構建訓練數據。對於塗鴉生成任務,我們定義了三類任務:基於塗鴉與指令的生成、基於塗鴉與多模態指令的生成及隨手畫生成,其數據構建流程與編輯任務類似。在框架設計上,針對傳統二值掩膜難以處理多塗鴉、多圖像與複雜指令協同編輯的侷限,我們提出聯合輸入方案:將原始圖像與帶塗鴉的源圖像同時輸入模型,並通過不同顏色區分區域以簡化處理。通過對兩幅圖像施加相同的索引與位置編碼,模型能精確定位塗鴉區域並保持編輯準確性。最後,我們為這些任務建立了綜合基準測試以推動後續研究。實驗結果表明DreamOmni3性能優異,模型與代碼將公開釋出。
English
Recently unified generation and editing models have achieved remarkable success with their impressive performance. These models rely mainly on text prompts for instruction-based editing and generation, but language often fails to capture users intended edit locations and fine-grained visual details. To this end, we propose two tasks: scribble-based editing and generation, that enables more flexible creation on graphical user interface (GUI) combining user textual, images, and freehand sketches. We introduce DreamOmni3, tackling two challenges: data creation and framework design. Our data synthesis pipeline includes two parts: scribble-based editing and generation. For scribble-based editing, we define four tasks: scribble and instruction-based editing, scribble and multimodal instruction-based editing, image fusion, and doodle editing. Based on DreamOmni2 dataset, we extract editable regions and overlay hand-drawn boxes, circles, doodles or cropped image to construct training data. For scribble-based generation, we define three tasks: scribble and instruction-based generation, scribble and multimodal instruction-based generation, and doodle generation, following similar data creation pipelines. For the framework, instead of using binary masks, which struggle with complex edits involving multiple scribbles, images, and instructions, we propose a joint input scheme that feeds both the original and scribbled source images into the model, using different colors to distinguish regions and simplify processing. By applying the same index and position encodings to both images, the model can precisely localize scribbled regions while maintaining accurate editing. Finally, we establish comprehensive benchmarks for these tasks to promote further research. Experimental results demonstrate that DreamOmni3 achieves outstanding performance, and models and code will be publicly released.
PDF102January 1, 2026