대규모 언어 모델과 디퓨전 트랜스포머의 심층 융합을 통한 텍스트-이미지 합성 탐구

초록

본 논문은 새로운 방법론을 제안하기보다는, 최근 텍스트-이미지 합성 분야의 중요한 발전과 관련된, 다소 간과되어 온 설계 공간에 대한 심층적인 탐구를 제공한다. 특히, 대규모 언어 모델(LLMs)과 디퓨전 트랜스포머(DiTs)의 깊은 융합을 통한 다중 모드 생성에 초점을 맞춘다. 기존 연구들은 주로 전체 시스템 성능에 집중했으며, 대체 방법론과의 상세한 비교나 주요 설계 세부사항 및 학습 레시피는 종종 공개되지 않았다. 이러한 공백은 해당 접근법의 실제 잠재력에 대한 불확실성을 야기한다. 이러한 공백을 메우기 위해, 본 연구는 텍스트-이미지 생성에 대한 실증적 연구를 수행하며, 기존의 확립된 베이스라인과의 통제된 비교를 진행하고, 중요한 설계 선택을 분석하며, 대규모 학습을 위한 명확하고 재현 가능한 레시피를 제공한다. 이 연구가 다중 모드 생성 분야의 향후 연구에 의미 있는 데이터 포인트와 실용적인 가이드라인을 제공할 수 있기를 바란다.

English

This paper does not describe a new method; instead, it provides a thorough exploration of an important yet understudied design space related to recent advances in text-to-image synthesis -- specifically, the deep fusion of large language models (LLMs) and diffusion transformers (DiTs) for multi-modal generation. Previous studies mainly focused on overall system performance rather than detailed comparisons with alternative methods, and key design details and training recipes were often left undisclosed. These gaps create uncertainty about the real potential of this approach. To fill these gaps, we conduct an empirical study on text-to-image generation, performing controlled comparisons with established baselines, analyzing important design choices, and providing a clear, reproducible recipe for training at scale. We hope this work offers meaningful data points and practical guidelines for future research in multi-modal generation.

대규모 언어 모델과 디퓨전 트랜스포머의 심층 융합을 통한 텍스트-이미지 합성 탐구

Exploring the Deep Fusion of Large Language Models and Diffusion Transformers for Text-to-Image Synthesis

초록

Support