TC-AE：ディープ圧縮オートエンコーダーのトークン容量解放

要旨

我々は、深層圧縮オートエンコーダーのためのViTベースのアーキテクチャであるTC-AEを提案する。既存手法では、高圧縮率において再構成品質を維持するため、潜在表現のチャネル数を増加させる方法が一般的である。しかし、この戦略は潜在表現の崩壊を引き起こしやすく、生成性能の低下を招く。TC-AEは、複雑化するアーキテクチャや多段階の学習スキームに依存する代わりに、ピクセルと画像潜在表現の重要な橋渡し役であるトークン空間の観点からこの課題に取り組む。これを実現するため、二つの相補的な革新を導入する。第一に、固定の潜在表現予算の下でViTのパッチサイズを調整することによるトークン数スケーリングを検討し、効果的なスケーリングを制限する主要因が過度なトークンから潜在表現への圧縮であることを明らかにする。この問題に対処するため、トークンから潜在表現への圧縮を二段階に分解し、構造的情報の損失を軽減するとともに、生成のための効果的なトークン数スケーリングを可能にする。第二に、潜在表現の崩壊をさらに緩和するため、自己教師あり学習を併用して画像トークンの意味的構造を強化し、生成に適した潜在表現を実現する。これらの設計により、TC-AEは深層圧縮下で大幅に改善された再構成性能と生成性能を達成する。我々の研究が、視覚的生成のためのViTベースのトークナイザーの発展に寄与することを期待する。

English

We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.

TC-AE：ディープ圧縮オートエンコーダーのトークン容量解放

TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders

要旨

Support