ChatPaper.aiChatPaper

DreamOmni2:基于多模态指令的编辑与生成系统

DreamOmni2: Multimodal Instruction-based Editing and Generation

October 8, 2025
作者: Bin Xia, Bohao Peng, Yuechen Zhang, Junjia Huang, Jiyang Liu, Jingyao Li, Haoru Tan, Sitong Wu, Chengyao Wang, Yitong Wang, Xinglong Wu, Bei Yu, Jiaya Jia
cs.AI

摘要

近期,基于指令的图像编辑和主体驱动生成技术取得了显著进展,但这两项任务在满足实际用户需求方面仍存在局限。基于指令的编辑仅依赖语言指令,往往难以捕捉具体的编辑细节,因此需要参考图像。而主体驱动生成则局限于结合具体物体或人物,忽视了更广泛的抽象概念。为解决这些挑战,我们提出了两项新任务:多模态指令驱动的编辑与生成。这些任务同时支持文本和图像指令,并将应用范围扩展至具体与抽象概念,极大地提升了其实用性。我们推出了DreamOmni2,主要应对数据创建和模型框架设计两大挑战。我们的数据合成流程包含三个步骤:(1) 采用特征混合方法为抽象与具体概念创建提取数据,(2) 利用编辑与提取模型生成多模态指令驱动的编辑训练数据,(3) 进一步应用提取模型制作多模态指令驱动编辑的训练数据。在框架设计上,为处理多图像输入,我们提出了索引编码与位置编码偏移方案,帮助模型区分图像并避免像素混淆。此外,我们引入了与视觉语言模型(VLM)及我们的生成/编辑模型联合训练的方法,以更好地处理复杂指令。同时,我们为这两项新任务提出了全面的基准测试,以推动其发展。实验表明,DreamOmni2已取得令人瞩目的成果。模型与代码将予以公开。
English
Recent advancements in instruction-based image editing and subject-driven generation have garnered significant attention, yet both tasks still face limitations in meeting practical user needs. Instruction-based editing relies solely on language instructions, which often fail to capture specific editing details, making reference images necessary. Meanwhile, subject-driven generation is limited to combining concrete objects or people, overlooking broader, abstract concepts. To address these challenges, we propose two novel tasks: multimodal instruction-based editing and generation. These tasks support both text and image instructions and extend the scope to include both concrete and abstract concepts, greatly enhancing their practical applications. We introduce DreamOmni2, tackling two primary challenges: data creation and model framework design. Our data synthesis pipeline consists of three steps: (1) using a feature mixing method to create extraction data for both abstract and concrete concepts, (2) generating multimodal instruction-based editing training data using the editing and extraction models, and (3) further applying the extraction model to create training data for multimodal instruction-based editing. For the framework, to handle multi-image input, we propose an index encoding and position encoding shift scheme, which helps the model distinguish images and avoid pixel confusion. Additionally, we introduce joint training with the VLM and our generation/editing model to better process complex instructions. In addition, we have proposed comprehensive benchmarks for these two new tasks to drive their development. Experiments show that DreamOmni2 has achieved impressive results. Models and codes will be released.
PDF313October 10, 2025