具有共享上下文-视觉分词器的统一多模态自回归建模是实现统一的关键。

摘要

统一多模态建模旨在将视觉理解与生成集成于同一系统之中。然而，现有方法通常依赖于两种不同的视觉分词器，这导致了表示空间的分裂，阻碍了真正的统一建模。我们提出UniAR，一种统一的自回归框架，其中单个离散视觉分词器作为理解和生成之间的关键桥梁，使模型能够在一个共享上下文中直接解释自身生成的视觉标记，而无需额外的重新编码。UniAR通过对预训练视觉编码器进行多级特征融合以及无查找逐位量化方案的适配，既保留了高层语义与低层细节，又以最小代价扩展了有效视觉词汇量。在此基础上，统一自回归模型采用并行逐位预测方式联合预测空间分组的多级视觉编码，大幅缩短视觉序列长度并加速生成。最后，基于扩散的视觉解码器对离散视觉标记进行解码，生成高保真图像。通过大规模预训练，再经监督微调和强化学习，UniAR在图像生成和图像编辑任务上取得了最先进性能，同时在多模态理解基准上保持竞争力。项目页面详见https://sharelab-sii.github.io/uniar-web。

English

Unified Multimodal Modeling aims to integrate visual understanding and generation within a single system. However, existing approaches typically rely on two disparate visual tokenizers, which splits the representation space and hinders truly unified modeling. We propose UniAR, a unified autoregressive framework where a single discrete visual tokenizer serves as the key bridge between understanding and generation, enabling a shared context in which the model can directly interpret its own generated visual tokens without additional re-encoding. UniAR adapts a pretrained vision encoder with multi-level feature fusion and a lookup-free bitwise quantization scheme, preserving both high-level semantics and low-level details while scaling the effective visual vocabulary at minimal cost. Building on this, the unified autoregressive model adopts parallel-bitwise-prediction to jointly predict spatially grouped, multi-level visual codes, substantially reducing visual sequence length and accelerating generation. Finally, a diffusion-based visual decoder operates on discrete visual tokens to decode high-fidelity images. Through large-scale pre-training, followed by supervised fine-tuning and reinforcement learning, UniAR achieves state-of-the-art performance on image generation and image editing while remaining competitive on multimodal understanding benchmarks. The project page is available at https://sharelab-sii.github.io/uniar-web.