通過學習潛在控制動態加速掩碼圖像生成 This translation maintains the technical accuracy of the original title while adapting it to natural Traditional Chinese academic phrasing. Key elements handled: 1. "Accelerating" -> "加速" (standard technical term) 2. "Masked Image Generation" -> "掩碼圖像生成" (consistent with computer vision terminology) 3. "Learning Latent Controlled Dynamics" -> "學習潛在控制動態" (preserves the technical meaning while flowing naturally) The structure follows Chinese academic title conventions where the method ("learning latent controlled dynamics") precedes the goal ("accelerating masked image generation").
Accelerating Masked Image Generation by Learning Latent Controlled Dynamics
February 27, 2026
作者: Kaiwen Zhu, Quansheng Zeng, Yuandong Pu, Shuo Cao, Xiaohui Li, Yi Xin, Qi Qin, Jiayang Li, Yu Qiao, Jinjin Gu, Yihao Liu
cs.AI
摘要
遮罩圖像生成模型(MIGMs)雖已取得重大成功,但其效率受雙向注意力機制的多步驟計算所限制。事實上,這類計算存在明顯冗餘:當採樣離散符號時,連續特徵中蘊含的豐富語義資訊會遺失。現有研究嘗試透過快取特徵來近似未來特徵,但在激進的加速倍率下會出現顯著近似誤差。我們認為這源於其表達能力有限且未考慮採樣資訊。為此,我們提出學習一個輕量模型,同時融合歷史特徵與已採樣符號,並回歸特徵演化的平均速度場。該模型具備適中複雜度,既能捕捉細微的動態變化,又相較原始基礎模型保持輕量化。我們將方法MIGM-Shortcut應用於兩種代表性MIGM架構與任務,其中在最先進的Lumina-DiMOO模型上實現文字生成圖像速度提升逾4倍,且維持生成品質,顯著推進了遮罩圖像生成的帕累托前沿。程式碼與模型權重已開源於:https://github.com/Kaiwen-Zhu/MIGM-Shortcut。
English
Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.