Hyper-Bagel：多模态理解與生成的統一加速框架

摘要

統一多模態模型近期因其在聯合理解與生成多樣化內容方面的卓越能力而備受關注。然而，隨著上下文整合了越來越多交織的多模態標記，擴散去噪和自迴歸解碼的迭代過程帶來了顯著的計算開銷。為解決這一問題，我們提出了Hyper-Bagel，這是一個旨在同時加速多模態理解與生成任務的統一加速框架。我們的方法採用分而治之的策略，利用推測解碼進行下一標記預測，並通過多階段蒸餾過程實現擴散去噪。該框架帶來了顯著的性能提升，在多模態理解任務中實現了超過2倍的加速。對於生成任務，我們最終的無損6-NFE模型在文本到圖像生成中實現了16.67倍的加速，在圖像編輯中實現了22倍的加速，同時保持了原始模型的高質量輸出。我們進一步開發了一種高效的1-NFE模型，能夠實現近乎實時的交互式編輯與生成。通過將先進的對抗蒸餾與人類反饋學習相結合，該模型達到了極致的成本效益和響應速度，使複雜的多模態交互變得無縫且即時。

English

Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.

Hyper-Bagel：多模态理解與生成的統一加速框架

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

摘要

Support