이미지 토크나이저는 사후 학습이 필요함

초록

최근의 이미지 생성 모델은 일반적으로 고정된 이미지 토크나이저에 의존하여 미리 구성된 잠재 공간에서 이미지 분포를 포착합니다. 그러나 재구성과 생성 분포 사이에는 상당한 차이가 존재하며, 현재의 토크나이저는 생성 훈련 전에 발생하는 재구성 작업만을 우선시하고 샘플링 중의 생성 오류는 고려하지 않습니다. 본 논문에서는 이산 잠재 공간에서 이러한 차이의 원인을 포괄적으로 분석하고, 이를 바탕으로 잠재 공간 구축과 디코딩을 각각 개선하는 데 초점을 맞춘 새로운 토크나이저 훈련 방식인 메인 훈련과 포스트 훈련을 제안합니다. 메인 훈련 중에는 샘플링 노이즈, 즉 생성 추론 중에 생성되는 예상치 못한 토큰을 시뮬레이션하기 위해 잠재적 교란 전략을 제안합니다. 구체적으로, 플러그 앤 플레이 방식의 토크나이저 훈련 방식을 제안하여 토크나이저의 견고성을 크게 향상시켜 생성 품질과 수렴 속도를 높이고, 토크나이저 성능을 생성 품질과 성공적으로 연관시키는 새로운 토크나이저 평가 지표인 pFID를 제안합니다. 포스트 훈련 중에는 잘 훈련된 생성 모델을 고려하여 토크나이저 디코더를 추가로 최적화하여 생성된 토큰과 재구성된 토큰 간의 분포 차이를 완화합니다. sim400M 생성기를 사용하여, 제안된 메인 훈련으로 훈련된 이산 토크나이저는 1.60 gFID를 달성하고 추가 포스트 훈련을 통해 1.36 gFID를 얻습니다. 추가 실험을 통해 제안된 포스트 훈련 전략이 오토리그레시브 및 디퓨전 기반 생성기와 함께 기존의 이산 및 연속 토크나이저에서도 효과적임을 광범위하게 검증합니다.

English

Recent image generative models typically capture the image distribution in a pre-constructed latent space, relying on a frozen image tokenizer. However, there exists a significant discrepancy between the reconstruction and generation distribution, where current tokenizers only prioritize the reconstruction task that happens before generative training without considering the generation errors during sampling. In this paper, we comprehensively analyze the reason for this discrepancy in a discrete latent space, and, from which, we propose a novel tokenizer training scheme including both main-training and post-training, focusing on improving latent space construction and decoding respectively. During the main training, a latent perturbation strategy is proposed to simulate sampling noises, \ie, the unexpected tokens generated in generative inference. Specifically, we propose a plug-and-play tokenizer training scheme, which significantly enhances the robustness of tokenizer, thus boosting the generation quality and convergence speed, and a novel tokenizer evaluation metric, \ie, pFID, which successfully correlates the tokenizer performance to generation quality. During post-training, we further optimize the tokenizer decoder regarding a well-trained generative model to mitigate the distribution difference between generated and reconstructed tokens. With a sim400M generator, a discrete tokenizer trained with our proposed main training achieves a notable 1.60 gFID and further obtains 1.36 gFID with the additional post-training. Further experiments are conducted to broadly validate the effectiveness of our post-training strategy on off-the-shelf discrete and continuous tokenizers, coupled with autoregressive and diffusion-based generators.

이미지 토크나이저는 사후 학습이 필요함

Image Tokenizer Needs Post-Training

초록

Support