InsightTok：面向自回归图像生成的离散令牌化中提升文本与人脸保真度

摘要

文本和人脸是视觉生成中最具感知显著性且应用最广泛的模式之一，然而对于基于离散分词化的自回归生成器而言，它们仍构成挑战。核心瓶颈在于分词器：激进的下采样和量化过程往往会丢失保持清晰字形和独特面部特征所需的细粒度结构。我们将这一差距归因于标准离散分词器目标与文本可读性和面部保真度之间关联薄弱——这些目标通常优化通用重建，同时对多样化内容进行统一压缩。为解决此问题，我们提出InsightTok，这是一种简单而高效的离散视觉分词框架，通过局部化且内容感知的损失函数增强文本和人脸的保真度。凭借紧凑的16k码本和16倍下采样率，InsightTok在文本和面部重建方面显著优于此前分词器，且不牺牲通用重建质量。这些优势持续迁移至InsightAR的自回归图像生成中，使其能生成文本更清晰、面部细节更逼真的图像。总体而言，我们的研究结果凸显了在分词器训练中引入专门监督对推进离散图像生成的潜力。

English

Text and faces are among the most perceptually salient and practically important patterns in visual generation, yet they remain challenging for autoregressive generators built on discrete tokenization. A central bottleneck is the tokenizer: aggressive downsampling and quantization often discard the fine-grained structures needed to preserve readable glyphs and distinctive facial features. We attribute this gap to standard discrete-tokenizer objectives being weakly aligned with text legibility and facial fidelity, as these objectives typically optimize generic reconstruction while compressing diverse content uniformly. To address this, we propose InsightTok, a simple yet effective discrete visual tokenization framework that enhances text and face fidelity through localized, content-aware perceptual losses. With a compact 16k codebook and a 16x downsampling rate, InsightTok significantly outperforms prior tokenizers in text and face reconstruction without compromising general reconstruction quality. These gains consistently transfer to autoregressive image generation in InsightAR, producing images with clearer text and more faithful facial details. Overall, our results highlight the potential of specialized supervision in tokenizer training for advancing discrete image generation.