视觉作为一种方言:通过文本对齐表征统一视觉理解与生成
Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations
June 23, 2025
作者: Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang
cs.AI
摘要
本文提出了一种多模态框架,旨在通过共享的离散语义表示统一视觉理解与生成。其核心是文本对齐分词器(TA-Tok),它利用从大型语言模型(LLM)词汇表中投影出的文本对齐码本,将图像转换为离散标记。通过将视觉与文本整合到一个扩展词汇的统一空间中,我们的多模态LLM——Tar,实现了跨模态的输入与输出,无需特定模态的设计。此外,我们提出了尺度自适应编码与解码,以平衡效率与视觉细节,并配备生成式反分词器以产生高保真视觉输出。为满足多样化解码需求,我们采用了两类互补的反分词器:快速自回归模型和基于扩散的模型。为加强模态融合,我们探索了先进的预训练任务,展示了在视觉理解与生成两方面的提升。跨基准测试的实验表明,Tar在匹配或超越现有多模态LLM方法的同时,实现了更快的收敛速度和更高的训练效率。代码、模型及数据可在https://tar.csuhan.com获取。
English
This paper presents a multimodal framework that attempts to unify visual
understanding and generation within a shared discrete semantic representation.
At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into
discrete tokens using a text-aligned codebook projected from a large language
model's (LLM) vocabulary. By integrating vision and text into a unified space
with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input
and output through a shared interface, without the need for modality-specific
designs. Additionally, we propose scale-adaptive encoding and decoding to
balance efficiency and visual detail, along with a generative de-tokenizer to
produce high-fidelity visual outputs. To address diverse decoding needs, we
utilize two complementary de-tokenizers: a fast autoregressive model and a
diffusion-based model. To enhance modality fusion, we investigate advanced
pre-training tasks, demonstrating improvements in both visual understanding and
generation. Experiments across benchmarks show that Tar matches or surpasses
existing multimodal LLM methods, achieving faster convergence and greater
training efficiency. Code, models, and data are available at
https://tar.csuhan.com