圖像分詞器需進行後續訓練
Image Tokenizer Needs Post-Training
September 15, 2025
作者: Kai Qiu, Xiang Li, Hao Chen, Jason Kuen, Xiaohao Xu, Jiuxiang Gu, Yinyi Luo, Bhiksha Raj, Zhe Lin, Marios Savvides
cs.AI
摘要
近期圖像生成模型通常於預構建的潛在空間中捕捉圖像分佈,依賴於固定的圖像標記器。然而,重建與生成分佈之間存在顯著差異,現有標記器僅優先考慮生成訓練前的重建任務,而忽略了採樣過程中的生成誤差。本文於離散潛在空間中全面分析了此差異的原因,並據此提出了一種新穎的標記器訓練方案,包括主訓練與後訓練,分別著重於改進潛在空間構建與解碼。在主訓練階段,提出了一種潛在擾動策略以模擬採樣噪聲,即生成推理中產生的意外標記。具體而言,我們提出了一種即插即用的標記器訓練方案,顯著提升了標記器的魯棒性,從而提高了生成質量與收斂速度,並提出了一種新穎的標記器評估指標——pFID,成功將標記器性能與生成質量相關聯。在後訓練階段,我們進一步針對訓練良好的生成模型優化標記器解碼器,以減小生成與重建標記之間的分佈差異。使用sim400M生成器,經我們提出的主訓練方案訓練的離散標記器取得了顯著的1.60 gFID,並通過額外的後訓練進一步獲得1.36 gFID。進一步的實驗廣泛驗證了我們後訓練策略在現成離散與連續標記器上的有效性,並結合自回歸與基於擴散的生成器進行了測試。
English
Recent image generative models typically capture the image distribution in a
pre-constructed latent space, relying on a frozen image tokenizer. However,
there exists a significant discrepancy between the reconstruction and
generation distribution, where current tokenizers only prioritize the
reconstruction task that happens before generative training without considering
the generation errors during sampling. In this paper, we comprehensively
analyze the reason for this discrepancy in a discrete latent space, and, from
which, we propose a novel tokenizer training scheme including both
main-training and post-training, focusing on improving latent space
construction and decoding respectively. During the main training, a latent
perturbation strategy is proposed to simulate sampling noises, \ie, the
unexpected tokens generated in generative inference. Specifically, we propose a
plug-and-play tokenizer training scheme, which significantly enhances the
robustness of tokenizer, thus boosting the generation quality and convergence
speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully
correlates the tokenizer performance to generation quality. During
post-training, we further optimize the tokenizer decoder regarding a
well-trained generative model to mitigate the distribution difference between
generated and reconstructed tokens. With a sim400M generator, a discrete
tokenizer trained with our proposed main training achieves a notable 1.60 gFID
and further obtains 1.36 gFID with the additional post-training. Further
experiments are conducted to broadly validate the effectiveness of our
post-training strategy on off-the-shelf discrete and continuous tokenizers,
coupled with autoregressive and diffusion-based generators.