Een Afbeelding is 1/2 Tokens Waard Na Laag 2: Plug-and-Play Inferentieversnelling voor Grote Vision-Language Modellen

Samenvatting

In deze studie identificeren we de inefficiënte aandachtverschijnselen in Grote Visueel-Taalmodellen (LVLMs), met name in prominente modellen zoals LLaVA-1.5, QwenVL-Chat en Video-LLaVA. We ontdekken dat de aandachtberekening over visuele tokens extreem inefficiënt is in de diepe lagen van populaire LVLMs, wat suggereert dat een spaarzamere aanpak nodig is in vergelijking met de verwerking van tekstuele gegevens. Hiertoe introduceren we FastV, een veelzijdige plug-and-play methode die is ontworpen om de computationele efficiëntie te optimaliseren door adaptieve aandachtpatronen in vroege lagen te leren en visuele tokens in latere lagen te snoeien. Onze evaluaties tonen aan dat FastV in staat is om de computationele kosten drastisch te verminderen (bijvoorbeeld een reductie van 45 in FLOPs voor LLaVA-1.5-13B) zonder in te leveren op prestaties in een breed scala aan beeld- en videobegriptaken. De afweging tussen computationele efficiëntie en prestaties van FastV is zeer aanpasbaar en pareto-efficiënt. Het kan de FLOPs van een model met 13B parameters comprimeren om een lager budget te bereiken dan dat van een model met 7B parameters, terwijl het nog steeds superieure prestaties behoudt. Wij geloven dat FastV praktische waarden heeft voor de implementatie van LVLMs in edge-apparaten en commerciële modellen. De code is vrijgegeven op https://github.com/pkunlp-icler/FastV.

English

In this study, we identify the inefficient attention phenomena in Large Vision-Language Models (LVLMs), notably within prominent models like LLaVA-1.5, QwenVL-Chat and Video-LLaVA. We find out that the attention computation over visual tokens is of extreme inefficiency in the deep layers of popular LVLMs, suggesting a need for a sparser approach compared to textual data handling. To this end, we introduce FastV, a versatile plug-and-play method designed to optimize computational efficiency by learning adaptive attention patterns in early layers and pruning visual tokens in subsequent ones. Our evaluations demonstrate FastV's ability to dramatically reduce computational costs (e.g., a 45 reduction in FLOPs for LLaVA-1.5-13B) without sacrificing performance in a wide range of image and video understanding tasks. The computational efficiency and performance trade-off of FastV are highly customizable and pareto-efficient. It can compress the FLOPs of a 13B-parameter model to achieve a lower budget than that of a 7B-parameter model, while still maintaining superior performance. We believe FastV has practical values for deployment of LVLMs in edge devices and commercial models. Code is released at https://github.com/pkunlp-icler/FastV.

Een Afbeelding is 1/2 Tokens Waard Na Laag 2: Plug-and-Play Inferentieversnelling voor Grote Vision-Language Modellen

An Image is Worth 1/2 Tokens After Layer 2: Plug-and-Play Inference Acceleration for Large Vision-Language Models

Samenvatting

Support