视觉作为一种方言：通过文本对齐表征统一视觉理解与生成

摘要

本文提出了一种多模态框架，旨在通过共享的离散语义表示统一视觉理解与生成。其核心是文本对齐分词器（TA-Tok），它利用从大型语言模型（LLM）词汇表中投影出的文本对齐码本，将图像转换为离散标记。通过将视觉与文本整合到一个扩展词汇的统一空间中，我们的多模态LLM——Tar，实现了跨模态的输入与输出，无需特定模态的设计。此外，我们提出了尺度自适应编码与解码，以平衡效率与视觉细节，并配备生成式反分词器以产生高保真视觉输出。为满足多样化解码需求，我们采用了两类互补的反分词器：快速自回归模型和基于扩散的模型。为加强模态融合，我们探索了先进的预训练任务，展示了在视觉理解与生成两方面的提升。跨基准测试的实验表明，Tar在匹配或超越现有多模态LLM方法的同时，实现了更快的收敛速度和更高的训练效率。代码、模型及数据可在https://tar.csuhan.com获取。

English

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com

视觉作为一种方言：通过文本对齐表征统一视觉理解与生成

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

摘要

Support