PyramidDrop: Accelerare i Tuoi Grandi Modelli Visione-Linguaggio tramite Riduzione della Redondanza Visiva a Piramide

Abstract

Nei grandi modelli di visione-linguaggio (LVLM), le immagini fungono da input che trasportano una grande quantità di informazioni. Come dice il detto "Un'immagine vale più di mille parole", rappresentare un'immagine singola nei LVLM attuali può richiedere centinaia o addirittura migliaia di token. Ciò comporta costi computazionali significativi, che crescono quadraticamente all'aumentare della risoluzione dell'immagine in ingresso, influenzando pesantemente l'efficienza sia dell'addestramento che dell'infereza. Approcci precedenti hanno cercato di ridurre il numero di token dell'immagine prima o all'interno dei primi strati dei LVLM. Tuttavia, queste strategie portano inevitabilmente alla perdita di informazioni cruciali sull'immagine, riducendo infine le prestazioni del modello. Per affrontare questa sfida, conduciamo uno studio empirico che rivela come tutti i token visivi siano necessari per i LVLM nei primi strati, e la ridondanza dei token aumenti progressivamente nei livelli più profondi del modello. A tal fine, proponiamo PyramidDrop, una strategia di riduzione della ridondanza visiva per i LVLM per migliorarne l'efficienza sia nell'addestramento che nell'infereza con una perdita di prestazioni trascurabile. In particolare, suddividiamo il LVLM in diverse fasi e eliminiamo parte dei token dell'immagine alla fine di ciascuna fase con un rapporto predefinito, creando token visivi a forma di piramide attraverso i livelli del modello. L'eliminazione si basa su un calcolo di similarità leggero con un tempo trascurabile. Estesi esperimenti dimostrano che PyramidDrop può ottenere un'accelerazione del tempo di addestramento del 40% e dei FLOPs di inferenza del 55% rispetto a LLaVA-NeXT con prestazioni comparabili. Inoltre, PyramidDrop potrebbe anche fungere da strategia plug-and-play per l'accelerazione dell'infereza senza addestramento, con prestazioni migliori e costi di inferenza inferiori rispetto ai concorrenti. Speriamo che le intuizioni e l'approccio introdotti da PyramidDrop ispirino la ricerca futura a approfondire ulteriormente il ruolo dei token visivi nei LVLM.

English

In large vision-language models (LVLMs), images serve as inputs that carry a wealth of information. As the idiom "A picture is worth a thousand words" implies, representing a single image in current LVLMs can require hundreds or even thousands of tokens. This results in significant computational costs, which grow quadratically as input image resolution increases, thereby severely impacting the efficiency of both training and inference. Previous approaches have attempted to reduce the number of image tokens either before or within the early layers of LVLMs. However, these strategies inevitably result in the loss of crucial image information, ultimately diminishing model performance. To address this challenge, we conduct an empirical study revealing that all visual tokens are necessary for LVLMs in the shallow layers, and token redundancy progressively increases in the deeper layers of the model. To this end, we propose PyramidDrop, a visual redundancy reduction strategy for LVLMs to boost their efficiency in both training and inference with neglectable performance loss. Specifically, we partition the LVLM into several stages and drop part of the image tokens at the end of each stage with a pre-defined ratio, creating pyramid-like visual tokens across model layers. The dropping is based on a lightweight similarity calculation with a negligible time overhead. Extensive experiments demonstrate that PyramidDrop can achieve a 40% training time and 55% inference FLOPs acceleration of LLaVA-NeXT with comparable performance. Besides, the PyramidDrop could also serve as a plug-and-play strategy for inference acceleration without training, with better performance and lower inference cost than counterparts. We hope that the insights and approach introduced by PyramidDrop will inspire future research to further investigate the role of image tokens in LVLMs.

PyramidDrop: Accelerare i Tuoi Grandi Modelli Visione-Linguaggio tramite Riduzione della Redondanza Visiva a Piramide

PyramidDrop: Accelerating Your Large Vision-Language Models via Pyramid Visual Redundancy Reduction

Abstract

Summary

Support

Support