关于多模态大型语言模型的推理解码

摘要

由于多模态大型语言模型（MLLMs）具有内存带宽瓶颈和自回归生成标记的大型语言模型骨干，因此推理速度较慢。在本文中，我们探讨了推测解码在提高MLLMs推理效率方面的应用，特别是LLaVA 7B模型。我们表明，仅语言模型可以作为LLaVA 7B推测解码的良好草稿模型，无需从草稿模型中获取图像标记及其相关处理组件。我们在三个不同任务上的实验表明，使用我们从头开始训练的具有1.15亿参数的语言模型，推测解码可以实现高达2.37倍的内存限制加速。此外，我们引入了一个包含图像适配器的紧凑LLaVA草稿模型，在图像字幕生成方面显示出微小的性能提升，同时在其他任务中保持可比的结果。

English

Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37times using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.

关于多模态大型语言模型的推理解码

On Speculative Decoding for Multimodal Large Language Models

摘要

Support