ChatPaper.aiChatPaper

潛在去噪造就優良的視覺標記器

Latent Denoising Makes Good Visual Tokenizers

July 21, 2025
作者: Jiawei Yang, Tianhong Li, Lijie Fan, Yonglong Tian, Yue Wang
cs.AI

摘要

儘管視覺分詞器在生成模型中扮演著基礎角色,但究竟哪些特性能使其在生成建模中更為有效,目前仍不明確。我們觀察到,現代生成模型在訓練目標上具有概念上的相似性——即從被破壞的輸入(如高斯噪聲或遮罩)中重建乾淨信號,這一過程我們稱之為去噪。基於這一洞見,我們提出將分詞器嵌入直接與下游的去噪目標對齊,促使潛在嵌入即使在嚴重破壞的情況下也能更容易地被重建。為實現這一目標,我們引入了潛在去噪分詞器(l-DeTok),這是一種簡單而有效的分詞器,專門訓練用於從被插值噪聲和隨機遮罩破壞的潛在嵌入中重建乾淨圖像。在ImageNet 256x256上的大量實驗表明,我們的分詞器在六種代表性生成模型中均一致優於標準分詞器。我們的研究結果強調了去噪作為分詞器開發的基礎設計原則,並希望這能激發未來分詞器設計的新視角。
English
Despite their fundamental role, it remains unclear what properties could make visual tokenizers more effective for generative modeling. We observe that modern generative models share a conceptually similar training objective -- reconstructing clean signals from corrupted inputs such as Gaussian noise or masking -- a process we term denoising. Motivated by this insight, we propose aligning tokenizer embeddings directly with the downstream denoising objective, encouraging latent embeddings to be more easily reconstructed even when heavily corrupted. To achieve this, we introduce the Latent Denoising Tokenizer (l-DeTok), a simple yet effective tokenizer trained to reconstruct clean images from latent embeddings corrupted by interpolative noise and random masking. Extensive experiments on ImageNet 256x256 demonstrate that our tokenizer consistently outperforms standard tokenizers across six representative generative models. Our findings highlight denoising as a fundamental design principle for tokenizer development, and we hope it could motivate new perspectives for future tokenizer design.
PDF91July 22, 2025