RecTok:沿修正流的重建蒸馏
RecTok: Reconstruction Distillation along Rectified Flow
December 15, 2025
作者: Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li
cs.AI
摘要
视觉分词器在扩散模型中起着关键作用。潜在空间的维度既控制着重建保真度,也决定了潜在特征的语义表达能力。然而维度与生成质量之间存在固有权衡,这限制了现有方法只能采用低维潜在空间。尽管近期研究利用视觉基础模型来增强视觉分词器的语义表达能力并加速收敛,但高维分词器的性能仍逊于低维版本。本研究提出RecTok方案,通过流语义蒸馏和重建对齐蒸馏两项关键创新,突破了高维视觉分词器的局限性。我们的核心洞见在于:将流匹配中的前向流构建为语义丰富的训练空间(作为扩散变换器的训练场),而非如既往研究那样聚焦于潜在空间本身。具体而言,我们的方法将视觉基础模型中的语义信息蒸馏至流匹配的前向轨迹中,并通过引入掩码特征重建损失进一步强化语义表达。RecTok在图像重建、生成质量和判别性能方面均实现卓越表现,在有无分类器引导的两种设置下均于gFID-50K指标上取得最先进成果,同时保持语义丰富的潜在空间结构。值得注意的是,随着潜在维度增加,我们观察到性能的持续提升。代码与模型已发布于https://shi-qingyu.github.io/rectok.github.io。
English
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.