MIGE: 멀티모달 명령 기반 이미지 생성 및 편집을 위한 통합 프레임워크

초록

확산 기반 이미지 생성에서 상당한 진전이 있었음에도 불구하고, 주체 기반 생성과 지시 기반 편집은 여전히 어려운 과제로 남아 있습니다. 기존 방법들은 일반적으로 이를 별도로 처리하며, 고품질 데이터의 부족과 낮은 일반화 성능으로 어려움을 겪습니다. 그러나 두 작업 모두 입력과 출력 간의 일관성을 유지하면서 복잡한 시각적 변화를 포착해야 합니다. 따라서 우리는 다중 모달 지침을 사용하여 작업 표현을 표준화하는 통합 프레임워크인 MIGE를 제안합니다. MIGE는 주체 기반 생성을 빈 캔버스 위의 창작으로, 지시 기반 편집을 기존 이미지의 수정으로 간주하여 공유된 입력-출력 공식을 수립합니다. MIGE는 자유 형식의 다중 모달 지침을 통합된 시각-언어 공간으로 매핑하는 새로운 다중 모달 인코더를 도입하며, 특징 융합 메커니즘을 통해 시각적 및 의미적 특징을 통합합니다. 이러한 통합은 두 작업의 공동 학습을 가능하게 하여 두 가지 주요 이점을 제공합니다: (1) 작업 간 강화: 공유된 시각적 및 의미적 표현을 활용함으로써, 공동 학습은 주체 기반 생성과 지시 기반 편집 모두에서 지시 준수와 시각적 일관성을 개선합니다. (2) 일반화: 통합된 형식으로 학습함으로써 작업 간 지식 전달이 용이해져, MIGE는 지시 기반 주체 편집을 포함한 새로운 조합 작업으로 일반화할 수 있습니다. 실험 결과, MIGE는 주체 기반 생성과 지시 기반 편집 모두에서 뛰어난 성능을 보이며, 지시 기반 주체 편집이라는 새로운 작업에서 최첨단 성과를 달성했습니다. 코드와 모델은 https://github.com/Eureka-Maggie/MIGE에서 공개되어 있습니다.

English

Despite significant progress in diffusion-based image generation, subject-driven generation and instruction-based editing remain challenging. Existing methods typically treat them separately, struggling with limited high-quality data and poor generalization. However, both tasks require capturing complex visual variations while maintaining consistency between inputs and outputs. Therefore, we propose MIGE, a unified framework that standardizes task representations using multimodal instructions. It treats subject-driven generation as creation on a blank canvas and instruction-based editing as modification of an existing image, establishing a shared input-output formulation. MIGE introduces a novel multimodal encoder that maps free-form multimodal instructions into a unified vision-language space, integrating visual and semantic features through a feature fusion mechanism.This unification enables joint training of both tasks, providing two key advantages: (1) Cross-Task Enhancement: By leveraging shared visual and semantic representations, joint training improves instruction adherence and visual consistency in both subject-driven generation and instruction-based editing. (2) Generalization: Learning in a unified format facilitates cross-task knowledge transfer, enabling MIGE to generalize to novel compositional tasks, including instruction-based subject-driven editing. Experiments show that MIGE excels in both subject-driven generation and instruction-based editing while setting a state-of-the-art in the new task of instruction-based subject-driven editing. Code and model have been publicly available at https://github.com/Eureka-Maggie/MIGE.

MIGE: 멀티모달 명령 기반 이미지 생성 및 편집을 위한 통합 프레임워크

MIGE: A Unified Framework for Multimodal Instruction-Based Image Generation and Editing

초록

Support