UniDDT: 분리된 확산 트랜스포머를 통한 멀티모달 이해 및 생성의 통합

초록

통합 멀티모달 모델(UMM)은 이해와 생성을 단일 프레임워크로 통합하는 범용 멀티모달 지능의 중요한 방향으로 부상했다. 그러나 기존 UMM은 다음과 같은 두드러진 문제점에 직면한다: (1) 시각적 이해와 생성 작업 간의 본질적인 학습 충돌로 인해 두 작업 모두에서 차선의 모델링이 발생함; (2) 이해와 생성의 시각적 공간이 달라 확장성을 저해함; (3) 텍스트-이미지 이해와 생성의 이중성을 무시한 작업 특화 데이터에 대한 과도한 의존. 이러한 문제를 해결하기 위해 우리는 UniDDT를 제안한다. 이는 잡음 ViT 인코더와 LLM을 활용하여 시각적 생성 및 이해 작업을 위한 의미적 인코딩을 통합하고, 별도의 확산 디코더를 사용하여 확산 디코딩과 텍스트 디코딩을 분리한다. 이 잡음 ViT 인코더를 통해 UniDDT는 잠재 공간을 통합 시각 표현으로 활용하여 이해와 생성 작업 간의 원활한 호환성을 가능하게 한다. 따라서 생성 작업 내 확장성과 이해 작업 내 의미적 표현력 사이의 균형을 이룰 수 있다. 또한, 동일한 이미지-텍스트 쌍으로부터 이중 데이터 구조를 구축하여 생성 데이터와 이해 데이터 간의 상호 의존성을 촉진하고, 이들의 본질적 이중성을 활용한다. 광범위한 실험을 통해 UniDDT는 향상된 의미적 일관성과 확장성을 바탕으로 멀티모달 이해와 생성을 효과적으로 통합함을 입증한다. 시각적 생성 작업에서 UniDDT는 0.87의 GenEval 점수와 86.9의 DPG 종합 점수를 달성한다. 멀티모달 이해 작업에서는 MME 벤치마크에서 1699.5점, SEEDbench에서 76.5의 종합 점수를 달성한다.

English

Unified Multimodal Models (UMMs) have emerged as a critical direction for general-purpose multimodal intelligence, integrating understanding and generation into a single framework. However, existing UMMs face prominent challenges: (1) the inherent learning conflicts between visual understanding and generation tasks, leading to suboptimal modeling in both tasks; (2) different understanding and generation visual spaces impeding scalability; (3) over-reliance on task-specific data that neglects the duality of text-image understanding and generation. To address these challenges, we propose UniDDT, which leverages a Noisy ViT encoder along with an LLM to unify semantic encoding for visual generation and understanding tasks, while employing a separate diffusion decoder to decouple diffusion decoding from text decoding. With this Noisy ViT encoder, UniDDT is able to leverage the latent space as a unified visual representation, enabling seamless compatibility between understanding and generation tasks. Thus, the scalability within the generation tasks and the semantic expressiveness within understanding tasks can be balanced. Also, we construct dual data structures from the same image-text pairs, fostering interdependence between the generation and understanding data to exploit their inherent duality. Extensive experiments demonstrate that UniDDT achieves effective unification of multimodal understanding and generation with enhanced semantic consistency and scalability. For visual generation tasks, our UniDDT achieves 0.87 GenEval score and 86.9 DPG overall score. For multimodal understanding tasks, our UniDDT achieves 1699.5 score on MME benchmark and 76.5 overall score on SEEDbench.