TUNA:馴服統一視覺表徵以構建原生統一多模態模型
TUNA: Taming Unified Visual Representations for Native Unified Multimodal Models
December 1, 2025
作者: Zhiheng Liu, Weiming Ren, Haozhe Liu, Zijian Zhou, Shoufa Chen, Haonan Qiu, Xiaoke Huang, Zhaochong An, Fanny Yang, Aditya Patel, Viktar Atliha, Tony Ng, Xiao Han, Chuyan Zhu, Chenyang Zhang, Ding Liu, Juan-Manuel Perez-Rua, Sen He, Jürgen Schmidhuber, Wenhu Chen, Ping Luo, Wei Liu, Tao Xiang, Jonas Schult, Yuren Cong
cs.AI
摘要
統一多模態模型(UMMs)旨在單一框架內實現多模態理解與生成的聯合處理。本文提出TUNA——一種原生統一多模態模型,其通過將VAE編碼器與表徵編碼器級聯構建出統一的連續視覺表徵空間。這種統一表徵空間允許對圖像和影片進行端到端的理解與生成任務處理。相較於先前採用解耦表徵的統一多模態模型,TUNA的統一視覺空間避免了因分離編碼器導致的表徵格式失配問題,在理解與生成任務上均超越了解耦方案。此外,我們發現更強的預訓練表徵編碼器能在所有多模態任務中持續提升性能,凸顯了表徵編碼器的關鍵作用。最終在這種統一架構下,聯合訓練理解與生成數據能使兩項任務相互促進而非干擾。我們在多模態理解與生成基準測試上的大量實驗表明,TUNA在圖像/影片理解、圖像/影片生成及圖像編輯任務中均達到最先進水平,證明了其統一表徵設計的有效性與可擴展性。
English
Unified multimodal models (UMMs) aim to jointly perform multimodal understanding and generation within a single framework. We present TUNA, a native UMM that builds a unified continuous visual representation by cascading a VAE encoder with a representation encoder. This unified representation space allows end-to-end processing of images and videos for both understanding and generation tasks. Compared to prior UMMs with decoupled representations, TUNA's unified visual space avoids representation format mismatches introduced by separate encoders, outperforming decoupled alternatives in both understanding and generation. Moreover, we observe that stronger pretrained representation encoders consistently yield better performance across all multimodal tasks, highlighting the importance of the representation encoder. Finally, in this unified setting, jointly training on both understanding and generation data allows the two tasks to benefit from each other rather than interfere. Our extensive experiments on multimodal understanding and generation benchmarks show that TUNA achieves state-of-the-art results in image and video understanding, image and video generation, and image editing, demonstrating the effectiveness and scalability of its unified representation design.