超贝果：多模态理解与生成的统一加速框架

摘要

统一多模态模型近期因其在联合理解与生成多样化内容方面的卓越能力而备受关注。然而，随着上下文整合了日益增多的交错多模态标记，扩散去噪和自回归解码的迭代过程带来了显著的计算开销。为此，我们提出了Hyper-Bagel，一个旨在同时加速多模态理解与生成任务的统一加速框架。我们的方法采用分而治之的策略，利用推测性解码进行下一标记预测，并通过多阶段蒸馏过程优化扩散去噪。该框架实现了显著的性能提升，在多模态理解任务中获得了超过2倍的加速。对于生成任务，我们提出的无损6-NFE模型在文本到图像生成上实现了16.67倍的加速，在图像编辑上实现了22倍的加速，同时保持了原模型的高质量输出。此外，我们还开发了一个高效的1-NFE模型，支持近乎实时的交互式编辑与生成。通过将先进的对抗蒸馏与人类反馈学习相结合，该模型达到了极致的成本效益和响应速度，使得复杂的多模态交互变得流畅且即时。

English

Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.

超贝果：多模态理解与生成的统一加速框架

Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation

摘要

Support