ChatPaper.aiChatPaper

DreamOmni2:基於多模態指令的編輯與生成

DreamOmni2: Multimodal Instruction-based Editing and Generation

October 8, 2025
作者: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
cs.AI

摘要

基於指令的圖像編輯和主體驅動生成技術的最新進展已引起廣泛關注,但這兩項任務在滿足實際用戶需求方面仍存在侷限。基於指令的編輯僅依賴語言指令,往往難以捕捉具體的編輯細節,因此需要參考圖像。而主體驅動生成則侷限於結合具體物體或人物,忽略了更廣泛的抽象概念。為應對這些挑戰,我們提出了兩項新任務:多模態基於指令的編輯和生成。這些任務支持文本和圖像指令,並將範圍擴展至包含具體和抽象概念,極大提升了其實際應用價值。我們推出了DreamOmni2,主要解決數據創建和模型框架設計兩大難題。我們的數據合成流程包含三個步驟:(1) 使用特徵混合方法創建抽象和具體概念的提取數據,(2) 利用編輯和提取模型生成多模態基於指令的編輯訓練數據,(3) 進一步應用提取模型創建多模態基於指令的編輯訓練數據。在框架設計上,為處理多圖像輸入,我們提出了索引編碼和位置編碼偏移方案,幫助模型區分圖像並避免像素混淆。此外,我們引入了與視覺語言模型(VLM)及生成/編輯模型的聯合訓練,以更好地處理複雜指令。同時,我們為這兩項新任務提出了全面的基準測試,以推動其發展。實驗表明,DreamOmni2已取得令人矚目的成果。模型和代碼將對外發布。
English
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
PDF313October 10, 2025