超贝果:多模态理解与生成的统一加速框架
Hyper-Bagel: A Unified Acceleration Framework for Multimodal Understanding and Generation
September 23, 2025
作者: Yanzuo Lu, Xin Xia, Manlin Zhang, Huafeng Kuang, Jianbin Zheng, Yuxi Ren, Xuefeng Xiao
cs.AI
摘要
统一多模态模型近期因其在联合理解与生成多样化内容方面的卓越能力而备受关注。然而,随着上下文整合了日益增多的交错多模态标记,扩散去噪和自回归解码的迭代过程带来了显著的计算开销。为此,我们提出了Hyper-Bagel,一个旨在同时加速多模态理解与生成任务的统一加速框架。我们的方法采用分而治之的策略,利用推测性解码进行下一标记预测,并通过多阶段蒸馏过程优化扩散去噪。该框架实现了显著的性能提升,在多模态理解任务中获得了超过2倍的加速。对于生成任务,我们提出的无损6-NFE模型在文本到图像生成上实现了16.67倍的加速,在图像编辑上实现了22倍的加速,同时保持了原模型的高质量输出。此外,我们还开发了一个高效的1-NFE模型,支持近乎实时的交互式编辑与生成。通过将先进的对抗蒸馏与人类反馈学习相结合,该模型达到了极致的成本效益和响应速度,使得复杂的多模态交互变得流畅且即时。
English
Unified multimodal models have recently attracted considerable attention for
their remarkable abilities in jointly understanding and generating diverse
content. However, as contexts integrate increasingly numerous interleaved
multimodal tokens, the iterative processes of diffusion denoising and
autoregressive decoding impose significant computational overhead. To address
this, we propose Hyper-Bagel, a unified acceleration framework designed to
simultaneously speed up both multimodal understanding and generation tasks. Our
approach uses a divide-and-conquer strategy, employing speculative decoding for
next-token prediction and a multi-stage distillation process for diffusion
denoising. The framework delivers substantial performance gains, achieving over
a 2x speedup in multimodal understanding. For generative tasks, our resulting
lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a
22x speedup in image editing, all while preserving the high-quality output of
the original model. We further develop a highly efficient 1-NFE model that
enables near real-time interactive editing and generation. By combining
advanced adversarial distillation with human feedback learning, this model
achieves ultimate cost-effectiveness and responsiveness, making complex
multimodal interactions seamless and instantaneous.