統一多模態自迴歸建模與共享上下文-視覺標記器是統一的關鍵
Unified Multimodal Autoregressive Modeling with Shared Context-Visual Tokenizer is Key to Unification
June 16, 2026
作者: Wujian Peng, Lingchen Meng, Yuxuan Cai, Xianwei Zhuang, Yuhuan Yang, Rongyao Fang, Chenfei Wu, Junyang Lin, Zuxuan Wu, Shuai Bai
cs.AI
摘要
統一多模態建模旨在將視覺理解與生成整合至單一系統中。然而,現有方法通常依賴兩種不同的視覺標記器,導致表徵空間分裂,阻礙真正的統一建模。我們提出UniAR,一個統一的自回歸框架,其中單一的離散視覺標記器作為理解與生成之間的關鍵橋樑,使模型能在共享語境中直接解讀自身生成的視覺標記,無需額外重新編碼。UniAR採用預訓練的視覺編碼器,結合多層級特徵融合與免查找按位量化機制,在保留高階語義與低階細節的同時,以最小成本擴展有效視覺詞彙量。在此基礎上,統一自回歸模型採用並行按位預測,聯合預測空間分組的多層級視覺編碼,大幅縮短視覺序列長度並加速生成。最後,基於擴散的視覺解碼器對離散視覺標記進行操作,解碼出高保真圖像。透過大規模預訓練,再經監督式微調與強化學習,UniAR在圖像生成與圖像編輯任務上達到最先進性能,同時在多模態理解基準上保持競爭力。專案頁面請見 https://sharelab-sii.github.io/uniar-web。
English
Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.