潜在制御ダイナミクスの学習によるマスク画像生成の高速化

要旨

マスク画像生成モデル（MIGM）は大きな成功を収めているものの、その効率は双方向注意機構の多段階処理によって阻害されている。実際、その計算には顕著な冗長性が存在する：離散トークンをサンプリングする際、連続特徴量に含まれる豊富な意味情報が失われてしまう。既存研究の一部は特徴量をキャッシュして将来の特徴量を近似しようと試みているが、積極的な高速化率の下では近似誤差が大きくなる。我々はこれを、表現力の限界とサンプリング情報の考慮不足に起因すると考える。この課題を解決するため、過去の特徴量とサンプリング済みトークンの両方を組み込み、特徴量進化の平均速度場を回帰する軽量モデルの学習を提案する。このモデルは、基盤モデルと比較して軽量を保ちつつ、微妙なダイナミクスを捉えるのに十分な適度な複雑性を有する。我々は本手法「MIGM-Shortcut」を、代表的な2つのMIGMアーキテクチャとタスクに適用した。特に最先端のLumina-DiMOOでは、テキスト対画像生成において品質を維持しつつ4倍超の高速化を達成し、マスク画像生成のパレートフロンティアを大幅に押し上げた。コードとモデル重みはhttps://github.com/Kaiwen-Zhu/MIGM-Shortcutで公開されている。

English

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

潜在制御ダイナミクスの学習によるマスク画像生成の高速化

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

要旨

Support