TokBench：视觉生成前的视觉分词器评估

摘要

在本研究中，我们揭示了视觉分词器（visual tokenizers）和变分自编码器（VAEs）在保留细粒度特征方面的局限性，并提出了一个基准来评估两种具有挑战性的视觉内容——文本和人脸的重建性能。视觉分词器和VAEs通过提供更高效的压缩或量化图像表示，显著推动了视觉生成和多模态建模的发展。然而，尽管这些技术有助于生产模型减轻计算负担，图像压缩带来的信息丢失从根本上限制了视觉生成质量的上限。为了评估这一上限，我们着重考察重建后的文本和面部特征，因为它们通常具有以下特点：1) 存在于较小尺度，2) 包含密集且丰富的纹理，3) 容易崩溃，4) 对人类视觉高度敏感。我们首先从现有数据集中收集并整理了一组多样化的清晰文本和人脸图像。与使用视觉语言模型（VLM）的方法不同，我们采用成熟的OCR和面部识别模型进行评估，确保准确性的同时，保持了极其轻量级的评估流程，仅需2GB内存和4分钟即可完成。利用我们的基准，我们分析了不同尺度下各种图像分词器和VAEs的文本和人脸重建质量。结果表明，现代视觉分词器在保留细粒度特征方面仍面临挑战，尤其是在较小尺度上。我们进一步将该评估框架扩展至视频领域，对视频分词器进行了全面分析。此外，我们证明传统指标无法准确反映人脸和文本的重建性能，而我们提出的指标则作为有效的补充。

English

In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images from existing datasets. Unlike approaches using VLM models, we employ established OCR and face recognition models for evaluation, ensuring accuracy while maintaining an exceptionally lightweight assessment process <span style="font-weight: bold; color: rgb(214, 21, 21);">requiring just 2GB memory and 4 minutes</span> to complete. Using our benchmark, we analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs. Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales. We further extend this evaluation framework to video, conducting comprehensive analysis of video tokenizers. Additionally, we demonstrate that traditional metrics fail to accurately reflect reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.

TokBench：视觉生成前的视觉分词器评估

TokBench: Evaluating Your Visual Tokenizer before Visual Generation

摘要

Support