멀티모달 대규모 언어 모델을 위한 스펙큘레이티브 디코딩에 관하여

초록

멀티모달 대형 언어 모델(MLLM)의 추론은 메모리 대역폭 병목 현상에 직면하고 토큰을 자기회귀적으로 생성하는 대형 언어 모델 백본으로 인해 느립니다. 본 논문에서는 MLLM, 특히 LLaVA 7B 모델의 추론 효율성을 향상시키기 위해 스펙추레이티브 디코딩(speculative decoding)의 적용을 탐구합니다. 우리는 언어 전용 모델이 LLaVA 7B와의 스펙추레이티브 디코딩을 위한 좋은 드래프트 모델로 사용될 수 있음을 보여주며, 이를 통해 이미지 토큰과 관련된 처리 구성 요소를 드래프트 모델에서 제외할 수 있음을 입증합니다. 세 가지 다른 작업에 대한 실험 결과, 스펙추레이티브 디코딩은 처음부터 학습한 115M 파라미터 언어 모델을 사용하여 최대 2.37배의 메모리 한계 속도 향상을 달성할 수 있음을 보여줍니다. 또한, 이미지 어댑터를 통합한 소형 LLaVA 드래프트 모델을 소개하며, 이 모델은 이미지 캡셔닝 작업에서 약간의 성능 향상을 보이면서도 다른 작업에서 비슷한 결과를 유지합니다.

English

Inference with Multimodal Large Language Models (MLLMs) is slow due to their large-language-model backbone which suffers from memory bandwidth bottleneck and generates tokens auto-regressively. In this paper, we explore the application of speculative decoding to enhance the inference efficiency of MLLMs, specifically the LLaVA 7B model. We show that a language-only model can serve as a good draft model for speculative decoding with LLaVA 7B, bypassing the need for image tokens and their associated processing components from the draft model. Our experiments across three different tasks show that speculative decoding can achieve a memory-bound speedup of up to 2.37times using a 115M parameter language model that we trained from scratch. Additionally, we introduce a compact LLaVA draft model incorporating an image adapter, which shows marginal performance gains in image captioning while maintaining comparable results in other tasks.

멀티모달 대규모 언어 모델을 위한 스펙큘레이티브 디코딩에 관하여

On Speculative Decoding for Multimodal Large Language Models

초록

Support