다중모드 대규모 언어 모델을 위한 적응형 추론 학습

초록

멀티모달 대형 언어 모델(MLLMs)은 추론 능력에서 인상적인 성능을 보여주지만, 상당한 계산 비용이 수반되어 자원이 제한된 환경에서의 배포가 어려운 실정입니다. 최근 MLLMs의 효율성을 개선하려는 노력이 있었지만, 기존 솔루션들은 특히 자원 가용성의 변화(예: 디바이스에서 실행 중인 다른 프로그램으로 인한 경합)와 같은 다양한 런타임 조건에 대응하는 데 한계가 있었습니다. 이러한 격차를 해소하기 위해, 우리는 AdaLLaVA를 소개합니다. AdaLLaVA는 추론 과정에서 MLLM의 연산을 동적으로 재구성하도록 학습하는 적응형 추론 프레임워크로, 입력 데이터와 지연 시간 예산을 고려합니다. 우리는 질문 응답, 추론, 환각(hallucination)을 포함한 벤치마크에서 광범위한 실험을 수행했습니다. 실험 결과, AdaLLaVA는 입력 지연 시간 예산을 효과적으로 준수하며, 런타임에서 다양한 정확도와 지연 시간의 트레이드오프를 달성함을 보여주었습니다. 또한, AdaLLaVA가 입력 지연 시간과 내용 모두에 적응할 수 있으며, 토큰 선택과 통합하여 효율성을 더욱 향상시킬 수 있고, 다양한 MLLMs에 일반화될 수 있음을 입증했습니다. 우리 프로젝트의 웹페이지와 코드는 https://zhuoyan-xu.github.io/ada-llava/에서 확인할 수 있습니다.

English

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in reasoning, yet come with substantial computational cost, limiting their deployment in resource-constrained settings. Despite recent efforts on improving the efficiency of MLLMs, prior solutions fall short in responding to varying runtime conditions, in particular changing resource availability (e.g., contention due to the execution of other programs on the device). To bridge this gap, we introduce AdaLLaVA, an adaptive inference framework that learns to dynamically reconfigure operations in an MLLM during inference, accounting for the input data and a latency budget. We conduct extensive experiments across benchmarks involving question-answering, reasoning, and hallucination. Our results show that AdaLLaVA effectively adheres to input latency budget, achieving varying accuracy and latency tradeoffs at runtime. Further, we demonstrate that AdaLLaVA adapts to both input latency and content, can be integrated with token selection for enhanced efficiency, and generalizes across MLLMs. Our project webpage with code release is at https://zhuoyan-xu.github.io/ada-llava/.

다중모드 대규모 언어 모델을 위한 적응형 추론 학습

Learning to Inference Adaptively for Multimodal Large Language Models

초록

Support