關於多模態大型語言模型的推測解碼
On Speculative Decoding for Multimodal Large Language Models
April 13, 2024
作者: Mukul Gagrani, Raghavv Goel, Wonseok Jeon, Junyoung Park, Mingu Lee, Christopher Lott
cs.AI
摘要
基於其龐大的語言模型骨幹,多模式大型語言模型(MLLMs)的推論速度較慢,這是由於其受到內存帶寬瓶頸的影響,並且生成 token 的過程是自回歸的。本文探討了對於提升 MLLMs 推論效率的應用,具體來說是 LLaVA 7B 模型的推測解碼。我們展示了僅使用語言模型作為 LLaVA 7B 的推測解碼的良好草稿模型,從而避免了從草稿模型中獲取圖像 token 及其相關處理組件的需求。我們在三個不同任務上的實驗表明,使用我們從頭開始訓練的具有 115M 參數的語言模型,推測解碼可以實現高達 2.37 倍的內存限制加速。此外,我們引入了一個包含圖像適配器的緊湊 LLaVA 草稿模型,該模型在圖像字幕生成方面顯示出輕微的性能提升,同時在其他任務中保持可比的結果。
English
Inference with Multimodal Large Language Models (MLLMs) is slow due to their
large-language-model backbone which suffers from memory bandwidth bottleneck
and generates tokens auto-regressively. In this paper, we explore the
application of speculative decoding to enhance the inference efficiency of
MLLMs, specifically the LLaVA 7B model. We show that a language-only model can
serve as a good draft model for speculative decoding with LLaVA 7B, bypassing
the need for image tokens and their associated processing components from the
draft model. Our experiments across three different tasks show that speculative
decoding can achieve a memory-bound speedup of up to 2.37times using a 115M
parameter language model that we trained from scratch. Additionally, we
introduce a compact LLaVA draft model incorporating an image adapter, which
shows marginal performance gains in image captioning while maintaining
comparable results in other tasks.Summary
AI-Generated Summary