TUNA: 통합 시각 표현의 제어를 통한 네이티브 통합 멀티모달 모델 구축

초록

통합 멀티모달 모델(UMMs)은 단일 프레임워크 내에서 멀티모달 이해와 생성을 동시에 수행하는 것을 목표로 합니다. 본 논문에서는 VAE 인코더와 표현 인코더를 계층적으로 결합하여 통합된 연속 시각 표현을 구축하는 네이티브 UMM인 TUNA를 제안합니다. 이 통합 표현 공간은 이미지와 비디오에 대한 이해 및 생성 작업을 엔드투엔드로 처리할 수 있게 합니다. 분리된 표현을 사용하는 기존 UMM들과 비교했을 때, TUNA의 통합 시각 공간은 별도의 인코더로 인한 표현 형식 불일치를 방지하여 이해와 생성 모두에서 분리형 대안들을 능가합니다. 더 나아가, 더 강력한 사전 학습된 표현 인코더가 모든 멀티모달 작업에서 일관되게 향상된 성능을 보여주며, 표현 인코더의 중요성을 부각합니다. 마지막으로, 이러한 통합 환경에서 이해와 생성 데이터를 함께 학습하면 두 작업이 상호 간섭하지 않고 서로 혜택을 얻을 수 있습니다. 멀티모달 이해 및 생성 벤치마크에 대한 폭넓은 실험을 통해 TUNA가 이미지/비디오 이해, 이미지/비디오 생성, 이미지 편집 분야에서 최첨단 성능을 달성함으로써 통합 표현 설계의 효과성과 확장성을 입증하였습니다.

English

Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.