ChatPaper.aiChatPaper

图像分词器需进行后训练

Image Tokenizer Needs Post-Training

September 15, 2025
作者: Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
cs.AI

摘要

近期,图像生成模型通常在一个预先构建的潜在空间中捕捉图像分布,依赖于一个固定的图像分词器。然而,重建与生成分布之间存在显著差异,当前的分词器仅专注于生成训练前的重建任务,而忽视了采样过程中产生的生成误差。本文深入分析了这一差异在离散潜在空间中的成因,并据此提出了一种新颖的分词器训练方案,包括主训练和后训练两个阶段,分别致力于优化潜在空间的构建和解码过程。在主训练阶段,我们引入了一种潜在扰动策略,以模拟生成推理中意外产生的采样噪声,即生成的不期望的标记。具体而言,我们提出了一种即插即用的分词器训练方案,显著增强了分词器的鲁棒性,从而提升了生成质量和收敛速度,并引入了一种新的分词器评估指标——pFID,成功地将分词器性能与生成质量关联起来。在后训练阶段,我们针对已训练好的生成模型进一步优化分词器解码器,以减轻生成标记与重建标记之间的分布差异。使用一个约400M参数的生成器,通过我们提出的主训练方案训练的离散分词器取得了显著的1.60 gFID成绩,并在加入后训练后进一步降至1.36 gFID。进一步的实验广泛验证了我们的后训练策略在现成的离散和连续分词器上的有效性,这些分词器与自回归和基于扩散的生成器相结合。
English
Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a sim400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.
PDF72September 18, 2025