ChatPaper.aiChatPaper

RecTok:基于整流流的重建蒸馏技术

RecTok: Reconstruction Distillation along Rectified Flow

December 15, 2025
作者: Qingyu Shi, Size Wu, Jinbin Bai, Kaidong Yu, Yujing Wang, Yunhai Tong, Xiangtai Li, Xuelong Li
cs.AI

摘要

視覺標記器在擴散模型中扮演關鍵角色。潛空間的維度不僅控制著重建保真度,更決定了潛在特徵的語義表達能力。然而維度與生成質量之間存在固有權衡,這使得現有方法只能受限於低維潛空間。儘管近期研究利用視覺基礎模型來增強視覺標記器的語義豐富度並加速收斂,但高維標記器的性能仍遜於低維版本。本文提出RecTok,通過兩項關鍵創新——流語義蒸餾與重建對齊蒸餾,突破高維視覺標記器的侷限性。我們的核心洞見在於:有別於過往研究聚焦於潛空間,應使流匹配中的前向流具備豐富語義,以此作為擴散變壓器的訓練空間。具體而言,本方法將視覺基礎模型中的語義信息蒸餾至流匹配的前向流軌跡,並通過引入掩碼特徵重建損失進一步強化語義表達。RecTok在圖像重建、生成質量與判別性能上均實現卓越表現,在有无分類器引導的gFID-50K評測中均取得最優結果,同時保持語義豐富的潛空間結構。更值得注意的是,隨著潛在維度提升,我們觀察到性能的持續改善。代碼與模型已開源於:https://shi-qingyu.github.io/rectok.github.io。
English
Visual tokenizers play a crucial role in diffusion models. The dimensionality of latent space governs both reconstruction fidelity and the semantic expressiveness of the latent feature. However, a fundamental trade-off is inherent between dimensionality and generation quality, constraining existing methods to low-dimensional latent spaces. Although recent works have leveraged vision foundation models to enrich the semantics of visual tokenizers and accelerate convergence, high-dimensional tokenizers still underperform their low-dimensional counterparts. In this work, we propose RecTok, which overcomes the limitations of high-dimensional visual tokenizers through two key innovations: flow semantic distillation and reconstruction--alignment distillation. Our key insight is to make the forward flow in flow matching semantically rich, which serves as the training space of diffusion transformers, rather than focusing on the latent space as in previous works. Specifically, our method distills the semantic information in VFMs into the forward flow trajectories in flow matching. And we further enhance the semantics by introducing a masked feature reconstruction loss. Our RecTok achieves superior image reconstruction, generation quality, and discriminative performance. It achieves state-of-the-art results on the gFID-50K under both with and without classifier-free guidance settings, while maintaining a semantically rich latent space structure. Furthermore, as the latent dimensionality increases, we observe consistent improvements. Code and model are available at https://shi-qingyu.github.io/rectok.github.io.
PDF32December 17, 2025