Uni-Edit: 지능형 편집은 통합 모델 튜닝을 위한 일반 작업이다

초록

현재 이미지 이해, 생성 및 편집 기능을 갖춘 통합 다중 모달 모델(UMM)의 성능 향상은 주로 혼합 다중 작업 훈련에 의존하고 있다. 내재된 작업 충돌로 인해 이러한 전략은 복잡한 다단계 파이프라인, 대규모 데이터 혼합 및 균형 조정 기법을 필요로 하며, 결과적으로 진정한 상호 강화가 아닌 성능 절충에 그친다. 이러한 패러다임을 깨기 위해, 우리는 Uni-Edit을 제안한다. 이는 UMM 튜닝을 위한 최초의 일반 작업으로 기능하는 지능형 이미지 편집 작업이다. 복잡한 혼합 파이프라인과 달리, Uni-Edit은 단일 작업, 단일 훈련 단계, 단일 데이터셋만을 사용하여 세 가지 능력을 동시에 향상시킨다. 구체적으로, 우리는 이미지 편집이 시각적 이해와 생성을 모두 자연스럽게 요구하므로 본질적으로 이상적인 일반 작업임을 먼저 확인한다. 그러나 기존 편집 데이터는 모델의 이해 능력을 심각하게 활용하지 못하는 단순한 지시문에 의존한다. 이를 해결하기 위해, 우리는 지능형 편집을 위한 최초의 자동화되고 확장 가능한 데이터 합성 파이프라인을 도입하여, 다양한 VQA 데이터를 질문이 내장되고 논리가 중첩된 복잡하고 효과적인 편집 지시문으로 변환한다. 이를 통해 다양한 추론 집약적 지시문과 고품질 편집 이미지를 짝지은 Uni-Edit-148k를 생성한다. BAGEL 및 Janus-Pro에 대한 광범위한 실험은 Uni-Edit만으로 튜닝했을 때 별도의 보조 작업 없이 세 가지 능력 모두에서 포괄적인 향상이 이루어짐을 입증한다.

English

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.