DreamOmni2:基於多模態指令的編輯與生成
DreamOmni2: Multimodal Instruction-based Editing and Generation
October 8, 2025
作者: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
cs.AI
摘要
基於指令的圖像編輯和主體驅動生成技術的最新進展已引起廣泛關注,但這兩項任務在滿足實際用戶需求方面仍存在侷限。基於指令的編輯僅依賴語言指令,往往難以捕捉具體的編輯細節,因此需要參考圖像。而主體驅動生成則侷限於結合具體物體或人物,忽略了更廣泛的抽象概念。為應對這些挑戰,我們提出了兩項新任務:多模態基於指令的編輯和生成。這些任務支持文本和圖像指令,並將範圍擴展至包含具體和抽象概念,極大提升了其實際應用價值。我們推出了DreamOmni2,主要解決數據創建和模型框架設計兩大難題。我們的數據合成流程包含三個步驟:(1) 使用特徵混合方法創建抽象和具體概念的提取數據,(2) 利用編輯和提取模型生成多模態基於指令的編輯訓練數據,(3) 進一步應用提取模型創建多模態基於指令的編輯訓練數據。在框架設計上,為處理多圖像輸入,我們提出了索引編碼和位置編碼偏移方案,幫助模型區分圖像並避免像素混淆。此外,我們引入了與視覺語言模型(VLM)及生成/編輯模型的聯合訓練,以更好地處理複雜指令。同時,我們為這兩項新任務提出了全面的基準測試,以推動其發展。實驗表明,DreamOmni2已取得令人矚目的成果。模型和代碼將對外發布。
English
Recent advancements in instruction-based image editing and subject-driven
generation have garnered significant attention, yet both tasks still face
limitations in meeting practical user needs. Instruction-based editing relies
solely on language instructions, which often fail to capture specific editing
details, making reference images necessary. Meanwhile, subject-driven
generation is limited to combining concrete objects or people, overlooking
broader, abstract concepts. To address these challenges, we propose two novel
tasks: multimodal instruction-based editing and generation. These tasks support
both text and image instructions and extend the scope to include both concrete
and abstract concepts, greatly enhancing their practical applications. We
introduce DreamOmni2, tackling two primary challenges: data creation and model
framework design. Our data synthesis pipeline consists of three steps: (1)
using a feature mixing method to create extraction data for both abstract and
concrete concepts, (2) generating multimodal instruction-based editing training
data using the editing and extraction models, and (3) further applying the
extraction model to create training data for multimodal instruction-based
editing. For the framework, to handle multi-image input, we propose an index
encoding and position encoding shift scheme, which helps the model distinguish
images and avoid pixel confusion. Additionally, we introduce joint training
with the VLM and our generation/editing model to better process complex
instructions. In addition, we have proposed comprehensive benchmarks for these
two new tasks to drive their development. Experiments show that DreamOmni2 has
achieved impressive results. Models and codes will be released.