TC-AE：释放深度压缩自编码器的令牌容量

摘要

我们提出TC-AE——一种基于ViT的深度压缩自编码器架构。现有方法通常通过增加潜在表征的通道数来维持高压缩比下的重建质量，但这种策略容易导致潜在表征坍塌，进而削弱生成性能。TC-AE并未依赖日益复杂的架构或多阶段训练方案，而是从像素与图像潜在表征的关键桥梁——令牌空间的角度出发，通过两项互补性创新应对这一挑战：首先，我们在固定潜在表征预算下通过调整ViT的补丁尺寸研究令牌数量缩放规律，发现激进的令牌到潜在表征压缩是限制有效缩放的关键因素。为此，我们将令牌到潜在表征的压缩分解为两个阶段，既减少了结构信息损失，又实现了适用于生成的令牌数量有效缩放。其次，为进一步缓解潜在表征坍塌，我们通过联合自监督训练增强图像令牌的语义结构，从而获得更利于生成的潜在表征。凭借这些设计，TC-AE在深度压缩条件下实现了显著提升的重建与生成性能。我们希望本研究能推动基于ViT的视觉生成令牌化器的发展。

English

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

TC-AE：释放深度压缩自编码器的令牌容量

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

摘要

Support