統一多模態自迴歸建模與共享上下文-視覺標記器是統一的關鍵

摘要

統一多模態建模旨在將視覺理解與生成整合至單一系統中。然而，現有方法通常依賴兩種不同的視覺標記器，導致表徵空間分裂，阻礙真正的統一建模。我們提出UniAR，一個統一的自回歸框架，其中單一的離散視覺標記器作為理解與生成之間的關鍵橋樑，使模型能在共享語境中直接解讀自身生成的視覺標記，無需額外重新編碼。UniAR採用預訓練的視覺編碼器，結合多層級特徵融合與免查找按位量化機制，在保留高階語義與低階細節的同時，以最小成本擴展有效視覺詞彙量。在此基礎上，統一自回歸模型採用並行按位預測，聯合預測空間分組的多層級視覺編碼，大幅縮短視覺序列長度並加速生成。最後，基於擴散的視覺解碼器對離散視覺標記進行操作，解碼出高保真圖像。透過大規模預訓練，再經監督式微調與強化學習，UniAR在圖像生成與圖像編輯任務上達到最先進性能，同時在多模態理解基準上保持競爭力。專案頁面請見 https://sharelab-sii.github.io/uniar-web。

English

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.