ε-VAE:去噪作为视觉解码
ε-VAE: Denoising as Visual Decoding
October 5, 2024
作者: Long Zhao, Sanghyun Woo, Ziyu Wan, Yandong Li, Han Zhang, Boqing Gong, Hartwig Adam, Xuhui Jia, Ting Liu
cs.AI
摘要
在生成建模中,标记化将复杂数据简化为紧凑、结构化的表示形式,从而创造出更高效、可学习的空间。对于高维视觉数据,它减少了冗余并强调关键特征,以实现高质量的生成。当前的视觉标记化方法依赖于传统的自编码器框架,其中编码器将数据压缩为潜在表示,解码器则重建原始输入。在这项工作中,我们提出了一种新的视角,将去噪作为解码的方式,从单步重建转变为迭代细化。具体而言,我们用扩散过程取代解码器,通过迭代细化噪声以恢复原始图像,其过程由编码器提供的潜在变量指导。我们通过评估重建(rFID)和生成质量(FID)来评估我们的方法,将其与最先进的自编码方法进行比较。我们希望这项工作能够为整合迭代生成和自编码以实现改进的压缩和生成提供新的见解。
English
In generative modeling, tokenization simplifies complex data into compact,
structured representations, creating a more efficient, learnable space. For
high-dimensional visual data, it reduces redundancy and emphasizes key features
for high-quality generation. Current visual tokenization methods rely on a
traditional autoencoder framework, where the encoder compresses data into
latent representations, and the decoder reconstructs the original input. In
this work, we offer a new perspective by proposing denoising as decoding,
shifting from single-step reconstruction to iterative refinement. Specifically,
we replace the decoder with a diffusion process that iteratively refines noise
to recover the original image, guided by the latents provided by the encoder.
We evaluate our approach by assessing both reconstruction (rFID) and generation
quality (FID), comparing it to state-of-the-art autoencoding approach. We hope
this work offers new insights into integrating iterative generation and
autoencoding for improved compression and generation.Summary
AI-Generated Summary