InsightTok：改善自迴歸圖像生成中離散標記化的文本與人臉保真度

摘要

文字與臉部是視覺生成中最具感知顯著性與實際重要性的模式之一，然而對於建構在離散標記化之上的自回歸生成器而言，它們仍構成挑戰。一個核心瓶頸在於標記化器：激進的下取樣與量化常會捨棄保留可讀字形與獨特臉部特徵所需的細部結構。我們將此差距歸因於標準離散標記化器的目標函數與文字可讀性及臉部逼真度之間的對齊不足，因為這些目標通常為了優化通用重建而壓縮多樣內容，卻未針對特定需求進行調整。為了解決此問題，我們提出InsightTok，一個簡單而有效的離散視覺標記化框架，透過局部化且具內容感知的感知損失來增強文字與臉部的逼真度。憑藉緊湊的16k碼本與16倍下取樣率，InsightTok在文字與臉部重建上顯著優於先前的標記化器，且不損害通用重建品質。這些增益一致地轉移到了InsightAR的自迴歸影像生成中，產出文字更清晰、臉部細節更忠實的影像。整體而言，我們的結果凸顯了在標記化器訓練中加入專門監督對於推進離散影像生成的潛力。

English

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.