マルチモーダル大規模言語モデルにおける推測的デコーディングについて

要旨

マルチモーダル大規模言語モデル（MLLM）の推論は、その大規模言語モデルバックボーンがメモリ帯域幅のボトルネックに悩まされ、かつトークンを自己回帰的に生成するため、遅いという課題があります。本論文では、LLaVA 7Bモデルを対象として、推論効率を向上させるためのSpeculative Decodingの応用を探ります。我々は、言語のみのモデルがLLaVA 7BのSpeculative Decodingにおいて良好なドラフトモデルとして機能し、画像トークンやそれに関連する処理コンポーネントをドラフトモデルから除外できることを示します。3つの異なるタスクでの実験により、我々がゼロから訓練した1億1500万パラメータの言語モデルを使用することで、最大2.37倍のメモリバウンドな高速化が達成できることを実証しました。さらに、画像アダプタを組み込んだコンパクトなLLaVAドラフトモデルを導入し、画像キャプショニングタスクではわずかな性能向上を示しつつ、他のタスクでも同等の結果を維持することを確認しました。

English

Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37times using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.

マルチモーダル大規模言語モデルにおける推測的デコーディングについて

On Speculative Decoding for Multimodal Large Language Models

要旨

Support