通过学习潜在控制动态加速掩码图像生成

摘要

掩码图像生成模型（MIGM）虽已取得显著成功，但其效率受限于双向注意力机制的多步计算。事实上，这类模型存在明显的计算冗余：当对离散标记进行采样时，连续特征中包含的丰富语义信息会丢失。现有研究尝试通过缓存特征来近似未来特征，但在激进加速比下会表现出较大近似误差。我们认为这源于其有限的表现力以及对采样信息的忽视。为填补这一空白，我们提出学习一个轻量级模型，该模型融合历史特征与已采样标记，并回归特征演化的平均速度场。该模型具有适中复杂度，既能捕捉细微动态变化，又保持相对于原始基础模型的轻量化特性。我们将所提方法MIGM-Shortcut应用于两种代表性MIGM架构与任务，尤其在最新Lumina-DiT模型上实现文本到图像生成质量无损的4倍以上加速，显著提升了掩码图像生成的帕累托前沿。代码与模型权重已开源：https://github.com/Kaiwen-Zhu/MIGM-Shortcut。

English

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

通过学习潜在控制动态加速掩码图像生成

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

摘要

Support