Uni-Edit: 智能编辑是统一模型调优的通用任务
Uni-Edit: Intelligent Editing Is A General Task For Unified Model Tuning
May 20, 2026
作者: Dian Zheng, Manyuan Zhang, Hongyu Li, Hongbo Liu, Kai Zou, Kaituo Feng, Hongsheng Li
cs.AI
摘要
目前,增强统一多模态模型(UMMs)在图像理解、生成和编辑方面的能力主要依赖于混合多任务训练。由于任务间固有的冲突,这种策略需要复杂的多阶段流程、大量数据混合和平衡技巧,最终仅实现性能权衡而非真正的相互增强。为突破这一范式,我们提出了Uni-Edit——一种智能图像编辑任务,作为首个可用于UMM调优的通用任务。与复杂的混合流程不同,Uni-Edit仅通过单一任务、单一训练阶段和单一数据集,即可同时提升所有三种能力。具体而言,我们首先发现图像编辑本质上是理想的通用任务,因为它天然需要视觉理解和生成能力。然而,现有编辑数据依赖于简单指令,严重低估了模型的理解能力。为此,我们引入了首套自动化、可扩展的智能编辑数据合成流程,将多样化的VQA数据转化为包含嵌入问题和嵌套逻辑的复杂、有效编辑指令。由此构建的Uni-Edit-148k数据集,将多样化的高推理需求指令与高质量编辑图像配对。在BAGEL和Janus-Pro上的大量实验表明,仅对Uni-Edit进行调优,无需任何辅助操作即可全面增强所有三种能力。
English
Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.