SUR-adapter: 대규모 언어 모델을 활용한 텍스트-이미지 사전 학습 확산 모델 개선

초록

텍스트-이미지 생성 모델로 인기를 끌고 있는 확산 모델(Diffusion models)은 텍스트 프롬프트를 기반으로 고품질이고 내용이 풍부한 이미지를 생성할 수 있습니다. 그러나 기존 모델들은 입력 프롬프트가 간결한 서술문일 경우 의미 이해와 상식 추론에 한계를 보이며, 이로 인해 저품질의 이미지가 생성되는 문제가 있습니다. 이러한 서술형 프롬프트에 대한 능력을 향상시키기 위해, 우리는 사전 학습된 확산 모델을 위한 간단하지만 효과적인 파라미터 효율적 미세 조정 접근법인 Semantic Understanding and Reasoning 어댑터(SUR-adapter)를 제안합니다. 이를 위해 먼저 57,000개 이상의 의미적으로 수정된 다중 모달 샘플로 구성된 새로운 데이터셋 SURD를 수집하고 주석을 달았습니다. 각 샘플은 간단한 서술형 프롬프트, 복잡한 키워드 기반 프롬프트, 그리고 고품질 이미지를 포함합니다. 그런 다음, 서술형 프롬프트의 의미 표현을 복잡한 프롬프트와 정렬하고, 대규모 언어 모델(LLM)의 지식을 지식 증류를 통해 SUR-adapter로 전이하여 텍스트-이미지 생성을 위한 고품질 텍스트 의미 표현을 구축할 수 있는 강력한 의미 이해 및 추론 능력을 획득하도록 합니다. 우리는 여러 LLM과 인기 있는 사전 학습된 확산 모델을 통합하여 실험을 수행함으로써, 우리의 접근법이 이미지 품질 저하 없이 간결한 자연어를 이해하고 추론할 수 있도록 확산 모델을 개선하는 데 효과적임을 보여줍니다. 우리의 접근법은 텍스트-이미지 확산 모델을 더 쉽게 사용할 수 있게 하여 사용자 경험을 개선하며, 이는 간단한 서술형 프롬프트와 복잡한 키워드 기반 프롬프트 간의 의미적 격차를 해소함으로써 사용자 친화적인 텍스트-이미지 생성 모델의 발전을 더욱 촉진할 잠재력을 가지고 있음을 보여줍니다.

English

Diffusion models, which have emerged to become popular text-to-image generation models, can produce high-quality and content-rich images guided by textual prompts. However, there are limitations to semantic understanding and commonsense reasoning in existing models when the input prompts are concise narrative, resulting in low-quality image generation. To improve the capacities for narrative prompts, we propose a simple-yet-effective parameter-efficient fine-tuning approach called the Semantic Understanding and Reasoning adapter (SUR-adapter) for pre-trained diffusion models. To reach this goal, we first collect and annotate a new dataset SURD which consists of more than 57,000 semantically corrected multi-modal samples. Each sample contains a simple narrative prompt, a complex keyword-based prompt, and a high-quality image. Then, we align the semantic representation of narrative prompts to the complex prompts and transfer knowledge of large language models (LLMs) to our SUR-adapter via knowledge distillation so that it can acquire the powerful semantic understanding and reasoning capabilities to build a high-quality textual semantic representation for text-to-image generation. We conduct experiments by integrating multiple LLMs and popular pre-trained diffusion models to show the effectiveness of our approach in enabling diffusion models to understand and reason concise natural language without image quality degradation. Our approach can make text-to-image diffusion models easier to use with better user experience, which demonstrates our approach has the potential for further advancing the development of user-friendly text-to-image generation models by bridging the semantic gap between simple narrative prompts and complex keyword-based prompts.

SUR-adapter: 대규모 언어 모델을 활용한 텍스트-이미지 사전 학습 확산 모델 개선

SUR-adapter: Enhancing Text-to-Image Pre-trained Diffusion Models with Large Language Models

초록

Support