画像トークナイザーは学習後の調整を必要とする

要旨

近年の画像生成モデルは、通常、凍結された画像トークナイザーに依存して、事前に構築された潜在空間で画像分布を捕捉します。しかし、再構成と生成分布の間には大きな乖離が存在し、現在のトークナイザーは生成トレーニング前に発生する再構成タスクのみを優先し、サンプリング中の生成エラーを考慮していません。本論文では、離散潜在空間におけるこの乖離の原因を包括的に分析し、そこから、潜在空間の構築とデコードのそれぞれに焦点を当てた、メイントレーニングとポストトレーニングを含む新しいトークナイザートレーニングスキームを提案します。メイントレーニング中には、生成推論中に発生する予期せぬトークン、すなわちサンプリングノイズをシミュレートするための潜在摂動戦略を提案します。具体的には、プラグアンドプレイのトークナイザートレーニングスキームを提案し、これによりトークナイザーの堅牢性が大幅に向上し、生成品質と収束速度が向上します。また、トークナイザーの性能を生成品質と関連付ける新しいトークナイザー評価指標、pFIDを提案します。ポストトレーニングでは、十分にトレーニングされた生成モデルに関してトークナイザーデコーダをさらに最適化し、生成されたトークンと再構成されたトークンの間の分布の違いを軽減します。sim400Mジェネレーターを使用して、提案されたメイントレーニングでトレーニングされた離散トークナイザーは、注目すべき1.60 gFIDを達成し、追加のポストトレーニングにより1.36 gFIDをさらに達成します。さらに、オートレグレッシブおよび拡散ベースのジェネレーターと組み合わせた、既存の離散および連続トークナイザーに対するポストトレーニング戦略の有効性を広く検証するための追加実験が行われました。

English

Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a sim400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.

画像トークナイザーは学習後の調整を必要とする

Image Tokenizer Needs Post-Training

要旨

Support