통합 다중모달 이산 확산

초록

다양한 모달리티를 이해하고 생성할 수 있는 멀티모달 생성 모델은 현재까지 왼쪽에서 오른쪽으로, 혹은 위에서 아래로 순차적으로 토큰을 처리하는 자기회귀(AR) 접근법이 주를 이루고 있습니다. 이러한 모델들은 이미지 캡셔닝, 질문 응답, 이미지 생성 등 다양한 작업을 위해 이미지, 텍스트, 비디오, 오디오를 함께 처리합니다. 본 연구에서는 텍스트 생성 분야에서 최근 성공을 거둔 이산 확산 모델(discrete diffusion model)을 텍스트와 이미지 영역을 통합한 생성 공식으로 탐구합니다. 이산 확산 모델은 AR 모델 대비 여러 가지 장점을 제공하는데, 이는 생성 샘플의 품질과 다양성 간의 향상된 제어, 텍스트와 이미지 영역 모두에서의 결합된 멀티모달 인페인팅(inpainting) 수행 능력, 그리고 가이던스를 통한 생성 과정의 더 큰 제어 가능성을 포함합니다. 이러한 이점을 활용하여, 우리는 다양한 다운스트림 작업을 위해 텍스트와 이미지를 함께 이해하고 생성할 수 있는 최초의 통합 멀티모달 이산 확산 모델(UniDisc)을 제안합니다. UniDisc를 멀티모달 AR 모델과 비교하여 스케일링 분석을 수행하고, UniDisc가 성능과 추론 시간 계산, 향상된 제어 가능성, 편집 가능성, 인페인팅, 그리고 추론 시간과 생성 품질 간의 유연한 트레이드오프 측면에서 우수함을 입증합니다. 코드와 추가 시각화 자료는 https://unidisc.github.io에서 확인할 수 있습니다.

English

Multimodal generative models that can understand and generate across multiple modalities are dominated by autoregressive (AR) approaches, which process tokens sequentially from left to right, or top to bottom. These models jointly handle images, text, video, and audio for various tasks such as image captioning, question answering, and image generation. In this work, we explore discrete diffusion models as a unified generative formulation in the joint text and image domain, building upon their recent success in text generation. Discrete diffusion models offer several advantages over AR models, including improved control over quality versus diversity of generated samples, the ability to perform joint multimodal inpainting (across both text and image domains), and greater controllability in generation through guidance. Leveraging these benefits, we present the first Unified Multimodal Discrete Diffusion (UniDisc) model which is capable of jointly understanding and generating text and images for a variety of downstream tasks. We compare UniDisc to multimodal AR models, performing a scaling analysis and demonstrating that UniDisc outperforms them in terms of both performance and inference-time compute, enhanced controllability, editability, inpainting, and flexible trade-off between inference time and generation quality. Code and additional visualizations are available at https://unidisc.github.io.