Instruct-Imagen: 다중 모달 지시를 통한 이미지 생성

초록

본 논문은 이질적인 이미지 생성 작업을 다루며 보이지 않는 작업들에 대해 일반화할 수 있는 instruct-imagen 모델을 소개합니다. 우리는 정밀하게 다양한 생성 의도를 표현하는 작업 표현 방식인 *다중 모달 명령어*를 이미지 생성에 도입했습니다. 이는 텍스트, 윤곽선, 스타일, 주제 등과 같은 다양한 모달리티를 자연어로 통합하여, 풍부한 생성 의도를 균일한 형식으로 표준화할 수 있게 합니다. 이후, 사전 학습된 텍스트-이미지 확산 모델을 두 단계 프레임워크로 미세 조정하여 instruct-imagen을 구축했습니다. 먼저, 외부 다중 모달 컨텍스트를 기반으로 생성 능력을 강화하기 위해 검색 증강 학습을 통해 모델을 적응시켰습니다. 그런 다음, 시각-언어 이해가 필요한 다양한 이미지 생성 작업(예: 주제 기반 생성 등)에 대해 적응된 모델을 미세 조정했으며, 각 작업은 해당 작업의 본질을 담은 다중 모달 명령어와 짝을 이루었습니다. 다양한 이미지 생성 데이터셋에 대한 인간 평가 결과, instruct-imagen은 기존의 작업 특화 모델들과 동등하거나 더 나은 성능을 보였으며, 보이지 않거나 더 복잡한 작업에 대한 유망한 일반화 능력을 입증했습니다.

English

This paper presents instruct-imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce *multi-modal instruction* for image generation, a task representation articulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build instruct-imagen by fine-tuning a pre-trained text-to-image diffusion model with a two-stage framework. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multimodal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets reveals that instruct-imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks.

Instruct-Imagen: 다중 모달 지시를 통한 이미지 생성

Instruct-Imagen: Image Generation with Multi-modal Instruction

초록

Support