이미지는 레이어 2 이후 1/2 토큰의 가치: 대규모 시각-언어 모델을 위한 플러그 앤 플레이 추론 가속

초록

본 연구에서는 LLaVA-1.5, QwenVL-Chat, Video-LLaVA와 같은 주요 대형 시각-언어 모델(LVLMs)에서 비효율적인 어텐션 현상을 확인했습니다. 특히, 이러한 모델의 깊은 층에서 시각 토큰에 대한 어텐션 계산이 극도로 비효율적이라는 것을 발견했으며, 이는 텍스트 데이터 처리에 비해 더 희소한 접근 방식이 필요함을 시사합니다. 이를 위해, 우리는 초기 층에서 적응형 어텐션 패턴을 학습하고 후속 층에서 시각 토큰을 제거함으로써 계산 효율성을 최적화하도록 설계된 다용도 플러그 앤 플레이 방법인 FastV를 소개합니다. 평가 결과, FastV는 다양한 이미지 및 비디오 이해 작업에서 성능 저하 없이 계산 비용을 극적으로 줄일 수 있음을 보여주었습니다(예: LLaVA-1.5-13B의 경우 FLOPs 45% 감소). FastV의 계산 효율성과 성능 간의 균형은 높은 수준으로 맞춤 설정이 가능하며 파레토 효율적입니다. 이는 13B 파라미터 모델의 FLOPs를 7B 파라미터 모델의 예산보다 낮게 압축하면서도 우수한 성능을 유지할 수 있습니다. 우리는 FastV가 LVLMs를 에지 디바이스 및 상용 모델에 배포하는 데 실용적인 가치가 있다고 믿습니다. 코드는 https://github.com/pkunlp-icler/FastV에서 공개되었습니다.

English

In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

이미지는 레이어 2 이후 1/2 토큰의 가치: 대규모 시각-언어 모델을 위한 플러그 앤 플레이 추론 가속

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

초록

Support