視覺作為一種方言：通過文本對齊表徵統一視覺理解與生成

摘要

本文提出了一种多模态框架，旨在通过共享的离散语义表征统一视觉理解与生成。其核心是文本对齐分词器（TA-Tok），该分词器利用从大型语言模型（LLM）词汇表中投影出的文本对齐码本，将图像转换为离散标记。通过将视觉与文本整合到一个扩展词汇的统一空间中，我们的多模态LLM——Tar，能够通过共享接口实现跨模态的输入与输出，而无需针对特定模态进行设计。此外，我们提出了尺度自适应编码与解码方法，以平衡效率与视觉细节，并引入生成式反分词器以生成高保真视觉输出。为满足多样化解码需求，我们采用了两种互补的反分词器：快速自回归模型和基于扩散的模型。为加强模态融合，我们探索了先进的预训练任务，展示了在视觉理解与生成方面的改进。跨基准测试的实验表明，Tar与现有多模态LLM方法相当或更优，实现了更快的收敛速度和更高的训练效率。代码、模型及数据可在https://tar.csuhan.com获取。

English

This paper presents a multimodal framework that attempts to unify visual understanding and generation within a shared discrete semantic representation. At its core is the Text-Aligned Tokenizer (TA-Tok), which converts images into discrete tokens using a text-aligned codebook projected from a large language model's (LLM) vocabulary. By integrating vision and text into a unified space with an expanded vocabulary, our multimodal LLM, Tar, enables cross-modal input and output through a shared interface, without the need for modality-specific designs. Additionally, we propose scale-adaptive encoding and decoding to balance efficiency and visual detail, along with a generative de-tokenizer to produce high-fidelity visual outputs. To address diverse decoding needs, we utilize two complementary de-tokenizers: a fast autoregressive model and a diffusion-based model. To enhance modality fusion, we investigate advanced pre-training tasks, demonstrating improvements in both visual understanding and generation. Experiments across benchmarks show that Tar matches or surpasses existing multimodal LLM methods, achieving faster convergence and greater training efficiency. Code, models, and data are available at https://tar.csuhan.com

視覺作為一種方言：通過文本對齊表徵統一視覺理解與生成

Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

摘要

Support