잠재 제어 역학 학습을 통한 마스킹 이미지 생성 가속화

초록

마스킹 이미지 생성 모델(MIGM)은 큰 성공을 거두었지만, 양방향 주의 메커니즘의 다단계 처리로 인해 효율성이 제한됩니다. 실제로 해당 계산에는 상당한 중복성이 존재하는데, 이산 토큰을 샘플링할 때 연속 특징에 포함된 풍부한 의미 정보가 손실되기 때문입니다. 기존 일부 연구에서는 특징을 캐싱하여 미래 특징을 근사하려 시도했으나, 공격적인 가속율 하에서는 상당한 근사 오차를 보입니다. 우리는 이 문제가 제한된 표현력과 샘플링 정보를 고려하지 못한 데 기인한다고 판단합니다. 이러한 격차를 해결하기 위해 우리는 이전 특징과 샘플링된 토큰을 모두 통합하고 특징 진화의 평균 속도장을 회귀하는 경량 모델 학습을 제안합니다. 해당 모델은 미세한 동역학을 포착할 수 있을 정도의 적절한 복잡성을 유지하면서도 기존 기본 모델 대비 경량성을 확보했습니다. 우리는 제안 방법인 MIGM-Shortcut을 두 가지 대표적인 MIGM 아키텍처와 작업에 적용했습니다. 특히 최첨단 Lumina-DiMOO에서 텍스트-이미지 생성 속도를 4배 이상 가속하면서도 품질을 유지하여 마스킹 이미지 생성의 파레토 최적 경계를 크게 확장했습니다. 코드와 모델 가중치는 https://github.com/Kaiwen-Zhu/MIGM-Shortcut에서 확인할 수 있습니다.

English

Masked Image Generation Models (MIGMs) have achieved great success, yet their efficiency is hampered by the multiple steps of bi-directional attention. In fact, there exists notable redundancy in their computation: when sampling discrete tokens, the rich semantics contained in the continuous features are lost. Some existing works attempt to cache the features to approximate future features. However, they exhibit considerable approximation error under aggressive acceleration rates. We attribute this to their limited expressivity and the failure to account for sampling information. To fill this gap, we propose to learn a lightweight model that incorporates both previous features and sampled tokens, and regresses the average velocity field of feature evolution. The model has moderate complexity that suffices to capture the subtle dynamics while keeping lightweight compared to the original base model. We apply our method, MIGM-Shortcut, to two representative MIGM architectures and tasks. In particular, on the state-of-the-art Lumina-DiMOO, it achieves over 4x acceleration of text-to-image generation while maintaining quality, significantly pushing the Pareto frontier of masked image generation. The code and model weights are available at https://github.com/Kaiwen-Zhu/MIGM-Shortcut.

잠재 제어 역학 학습을 통한 마스킹 이미지 생성 가속화

Accelerating Masked Image Generation by Learning Latent Controlled Dynamics

초록

Support