TokBench:視覺生成前的視覺標記器評估
TokBench: Evaluating Your Visual Tokenizer before Visual Generation
May 23, 2025
作者: Junfeng Wu, Dongliang Luo, Weizhi Zhao, Zhihao Xie, Yuanhao Wang, Junyi Li, Xudong Xie, Yuliang Liu, Xiang Bai
cs.AI
摘要
在本研究中,我們揭示了視覺標記器(visual tokenizers)和變分自編碼器(VAEs)在保留細粒度特徵方面的局限性,並提出了一個基準來評估兩種具有挑戰性的視覺內容——文本和臉部——的重建性能。視覺標記器和VAEs通過提供更高效的壓縮或量化圖像表示,顯著推進了視覺生成和多模態建模。然而,儘管這些技術幫助生產模型減少了計算負擔,但圖像壓縮帶來的信息損失從根本上限制了視覺生成質量的上限。為了評估這一上限,我們專注於評估重建的文本和臉部特徵,因為這些特徵通常具有以下特點:1) 存在於較小的尺度上,2) 包含密集且豐富的紋理,3) 容易崩潰,4) 對人類視覺高度敏感。我們首先從現有數據集中收集並整理了一組多樣化的清晰文本和臉部圖像。與使用視覺語言模型(VLM)的方法不同,我們採用成熟的OCR和臉部識別模型進行評估,確保準確性的同時保持極其輕量化的評估流程,僅需2GB內存和4分鐘即可完成。利用我們的基準,我們分析了不同圖像標記器和VAEs在各種尺度下的文本和臉部重建質量。結果表明,現代視覺標記器在保留細粒度特徵方面仍然存在困難,尤其是在較小尺度下。我們進一步將這一評估框架擴展到視頻領域,對視頻標記器進行了全面分析。此外,我們還展示了傳統指標無法準確反映臉部和文本的重建性能,而我們提出的指標則作為有效的補充。
English
In this work, we reveal the limitations of visual tokenizers and VAEs in
preserving fine-grained features, and propose a benchmark to evaluate
reconstruction performance for two challenging visual contents: text and face.
Visual tokenizers and VAEs have significantly advanced visual generation and
multimodal modeling by providing more efficient compressed or quantized image
representations. However, while helping production models reduce computational
burdens, the information loss from image compression fundamentally limits the
upper bound of visual generation quality. To evaluate this upper bound, we
focus on assessing reconstructed text and facial features since they typically:
1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to
collapse, and 4) are highly sensitive to human vision. We first collect and
curate a diverse set of clear text and face images from existing datasets.
Unlike approaches using VLM models, we employ established OCR and face
recognition models for evaluation, ensuring accuracy while maintaining an
exceptionally lightweight assessment process <span style="font-weight: bold;
color: rgb(214, 21, 21);">requiring just 2GB memory and 4 minutes</span> to
complete. Using our benchmark, we analyze text and face reconstruction quality
across various scales for different image tokenizers and VAEs. Our results show
modern visual tokenizers still struggle to preserve fine-grained features,
especially at smaller scales. We further extend this evaluation framework to
video, conducting comprehensive analysis of video tokenizers. Additionally, we
demonstrate that traditional metrics fail to accurately reflect reconstruction
performance for faces and text, while our proposed metrics serve as an
effective complement.