Hyper-Bagel: 다중모달 이해 및 생성을 위한 통합 가속화 프레임워크

초록

통합 멀티모달 모델은 최근 다양한 콘텐츠를 공동으로 이해하고 생성하는 놀라운 능력으로 인해 상당한 주목을 받고 있습니다. 그러나 컨텍스트가 점점 더 많은 인터리브된 멀티모달 토큰을 통합함에 따라, 디퓨전 노이즈 제거와 자기회귀 디코딩의 반복적인 프로세스는 상당한 계산 오버헤드를 초래합니다. 이를 해결하기 위해, 우리는 멀티모달 이해와 생성 작업을 동시에 가속화하도록 설계된 통합 가속 프레임워크인 Hyper-Bagel을 제안합니다. 우리의 접근 방식은 분할 정복 전략을 사용하며, 다음 토큰 예측을 위한 스펙티브 디코딩과 디퓨전 노이즈 제거를 위한 다단계 증류 프로세스를 활용합니다. 이 프레임워크는 멀티모달 이해에서 2배 이상의 성능 향상을 달성합니다. 생성 작업의 경우, 우리가 개발한 무손실 6-NFE 모델은 텍스트-이미지 생성에서 16.67배, 이미지 편집에서 22배의 속도 향상을 제공하며, 원본 모델의 고품질 출력을 유지합니다. 또한, 우리는 실시간에 가까운 인터랙티브 편집과 생성을 가능하게 하는 고효율 1-NFE 모델을 개발했습니다. 이 모델은 고급 적대적 증류와 인간 피드백 학습을 결합하여 궁극적인 비용 효율성과 반응성을 달성함으로써, 복잡한 멀티모달 상호작용을 원활하고 즉각적으로 만듭니다.

English

Unified multimodal models have recently attracted considerable attention for their remarkable abilities in jointly understanding and generating diverse content. However, as contexts integrate increasingly numerous interleaved multimodal tokens, the iterative processes of diffusion denoising and autoregressive decoding impose significant computational overhead. To address this, we propose Hyper-Bagel, a unified acceleration framework designed to simultaneously speed up both multimodal understanding and generation tasks. Our approach uses a divide-and-conquer strategy, employing speculative decoding for next-token prediction and a multi-stage distillation process for diffusion denoising. The framework delivers substantial performance gains, achieving over a 2x speedup in multimodal understanding. For generative tasks, our resulting lossless 6-NFE model yields a 16.67x speedup in text-to-image generation and a 22x speedup in image editing, all while preserving the high-quality output of the original model. We further develop a highly efficient 1-NFE model that enables near real-time interactive editing and generation. By combining advanced adversarial distillation with human feedback learning, this model achieves ultimate cost-effectiveness and responsiveness, making complex multimodal interactions seamless and instantaneous.