TokBench: 視覚生成前の視覚トークナイザー評価

要旨

本研究では、視覚的トークナイザーとVAE（変分オートエンコーダ）が細粒度の特徴を保持する際の限界を明らかにし、テキストと顔という2つの挑戦的な視覚コンテンツにおける再構成性能を評価するためのベンチマークを提案する。視覚的トークナイザーとVAEは、より効率的な圧縮または量子化された画像表現を提供することで、視覚生成とマルチモーダルモデリングを大きく進展させてきた。しかし、生成モデルの計算負荷を軽減する一方で、画像圧縮に伴う情報損失は、視覚生成品質の上限を根本的に制限している。この上限を評価するために、我々は再構成されたテキストと顔の特徴に焦点を当てる。これらは通常、1) より小さなスケールで存在し、2) 密で豊かなテクスチャを含み、3) 崩壊しやすく、4) 人間の視覚に対して非常に敏感であるという特性を持つ。まず、既存のデータセットから多様なクリアなテキストと顔画像を収集し、キュレーションする。VLMモデルを使用するアプローチとは異なり、評価には確立されたOCRおよび顔認識モデルを採用し、精度を保ちながら、わずか2GBのメモリと4分で完了する非常に軽量な評価プロセスを実現する。このベンチマークを用いて、さまざまなスケールにおける異なる画像トークナイザーとVAEのテキストおよび顔の再構成品質を分析する。その結果、現代の視覚的トークナイザーは、特に小さなスケールにおいて、細粒度の特徴を保持するのに依然として苦戦していることが明らかとなった。さらに、この評価フレームワークをビデオに拡張し、ビデオトークナイザーの包括的な分析を行う。加えて、従来のメトリクスは顔とテキストの再構成性能を正確に反映できないのに対し、我々が提案するメトリクスは有効な補完として機能することを示す。

English

In this work, we reveal the limitations of visual tokenizers and VAEs in preserving fine-grained features, and propose a benchmark to evaluate reconstruction performance for two challenging visual contents: text and face. Visual tokenizers and VAEs have significantly advanced visual generation and multimodal modeling by providing more efficient compressed or quantized image representations. However, while helping production models reduce computational burdens, the information loss from image compression fundamentally limits the upper bound of visual generation quality. To evaluate this upper bound, we focus on assessing reconstructed text and facial features since they typically: 1) exist at smaller scales, 2) contain dense and rich textures, 3) are prone to collapse, and 4) are highly sensitive to human vision. We first collect and curate a diverse set of clear text and face images from existing datasets. Unlike approaches using VLM models, we employ established OCR and face recognition models for evaluation, ensuring accuracy while maintaining an exceptionally lightweight assessment process <span style="font-weight: bold; color: rgb(214, 21, 21);">requiring just 2GB memory and 4 minutes</span> to complete. Using our benchmark, we analyze text and face reconstruction quality across various scales for different image tokenizers and VAEs. Our results show modern visual tokenizers still struggle to preserve fine-grained features, especially at smaller scales. We further extend this evaluation framework to video, conducting comprehensive analysis of video tokenizers. Additionally, we demonstrate that traditional metrics fail to accurately reflect reconstruction performance for faces and text, while our proposed metrics serve as an effective complement.

TokBench: 視覚生成前の視覚トークナイザー評価

TokBench: Evaluating Your Visual Tokenizer before Visual Generation

要旨

Support