Uni-Edit: 知的編集は統一モデルチューニングのための汎用的なタスク

要旨

現在、統一マルチモーダルモデル（UMM）に画像理解・生成・編集の能力を付与する方法は、主にマルチタスク学習の混合に依存しています。タスク間の本質的な競合により、この戦略では複雑な多段階パイプライン、膨大なデータの混合、バランス調整の技法が必要となり、結果として真の相互強化ではなく性能のトレードオフしか得られません。このパラダイムを打破するため、我々はUni-Editを提案します。これはUMMのチューニングにおける初の汎用タスクとして機能する知的画像編集タスクです。複雑な混合パイプラインとは異なり、Uni-Editは単一のタスク、単一の訓練段階、単一のデータセットのみを用いて、三つの能力すべてを同時に向上させます。具体的には、まず画像編集が、視覚的理解と生成の両方を自然に要求するという点で、本質的に理想的な汎用タスクであることを特定します。しかし、既存の編集データは単純な指示に依存しており、モデルの理解能力を著しく活用できていません。この問題に対処するため、我々は知的編集のための初の自動化・スケーラブルなデータ合成パイプラインを導入し、多様なVQAデータを、質問を埋め込み入れ子構造のロジックを持つ複雑で効果的な編集指示へと変換します。これにより、多様な推論集約型の指示と高品質な編集画像を組み合わせたUni-Edit-148kが得られます。 BAGELおよびJanus-Proを用いた広範な実験により、Uni-Editのみでのチューニングが、いかなる補助操作も必要とせずに三つの能力すべてに対して包括的な向上をもたらすことが実証されました。

English

Currently, enhancing Unified Multimodal Models (UMMs) with image understanding, generation, and editing capabilities mainly relies on mixed multi-task training. Due to inherent task conflicts, such strategy requires complex multi-stage pipelines, massive data mixing, and balancing tricks, merely resulting in a performance trade-off rather than true mutual reinforcement. To break this paradigm, we propose Uni-Edit, an intelligent image editing task that serves as the first general task for UMM tuning. Unlike complex mixed pipelines, Uni-Edit improves performance across all three abilities at once using only one task, one training stage, and one dataset. Specifically, we first identify image editing as an inherently ideal general task, as it naturally demands both visual understanding and generation. However, existing editing data relies on simplistic instructions that severely underutilize a model's understanding capacity. To address this, we introduce the first automated and scalable data synthesis pipeline for intelligent editing, transforming diverse VQA data into complex and effective editing instructions with embedded questions and nested logic. This yields Uni-Edit-148k, pairing diverse reasoning-intensive instructions with high-quality edited images. Extensive experiments on BAGEL and Janus-Pro demonstrate that tuning solely on Uni-Edit achieves comprehensive enhancements across all three capabilities without any auxiliary operations.