잠재 공간 노이즈 제거가 우수한 시각적 토크나이저를 만든다

초록

그들의 근본적인 역할에도 불구하고, 어떤 속성이 생성 모델링을 위해 시각적 토크나이저를 더 효과적으로 만들 수 있는지는 여전히 명확하지 않습니다. 우리는 현대 생성 모델들이 개념적으로 유사한 훈련 목표를 공유한다는 것을 관찰했습니다. 이 목표는 가우시안 노이즈나 마스킹과 같은 손상된 입력으로부터 깨끗한 신호를 재구성하는 것으로, 우리는 이 과정을 디노이징(denoising)이라고 명명합니다. 이러한 통찰에 동기를 받아, 우리는 토크나이저 임베딩을 직접 하류 디노이징 목표와 정렬하여, 심하게 손상된 경우에도 잠재 임베딩이 더 쉽게 재구성되도록 장려하는 방법을 제안합니다. 이를 달성하기 위해, 우리는 보간 노이즈와 랜덤 마스킹으로 손상된 잠재 임베딩으로부터 깨끗한 이미지를 재구성하도록 훈련된 간단하지만 효과적인 토크나이저인 Latent Denoising Tokenizer(l-DeTok)를 소개합니다. ImageNet 256x256에 대한 광범위한 실험은 우리의 토크나이저가 6개의 대표적인 생성 모델에서 표준 토크나이저들을 일관되게 능가한다는 것을 보여줍니다. 우리의 연구 결과는 디노이징이 토크나이저 개발을 위한 근본적인 설계 원칙임을 강조하며, 이는 향후 토크나이저 설계에 대한 새로운 관점을 고무할 수 있기를 바랍니다.

English

Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.

잠재 공간 노이즈 제거가 우수한 시각적 토크나이저를 만든다

Latent Denoising Makes Good Visual Tokenizers

초록

Support