Efficiënte Inferentie van Visuele Instructie-Volgende Modellen met Elastische Cache

Samenvatting

Op het gebied van instructievolgende grote visueel-taalmodelen (LVLMs) wordt de efficiënte inzet van deze modellen bemoeilijkt door de hoge geheugeneisen van hun key-value (KV) caches. Traditionele cachebeheerstrategieën voor LLMs richten zich op cache-evictie, wat vaak niet voldoet aan de specifieke behoeften van multimodale instructievolgende modellen. Gezien deze leemte introduceren we in dit artikel Elastic Cache, een nieuwe aanpak die profiteert van het toepassen van verschillende versnellingsmethoden voor de instructiecodering en uitvoergeneratiefasen. We onderzoeken de belangrijkste metrieken in verschillende fasen en stellen een op belangrijkheid gebaseerde cache-samenvoegstrategie voor om redundante caches te verwijderen. In plaats van minder belangrijke caches te verwijderen, identificeert onze strategie belangrijke key/value-vectoren als ankerpunten. Omringende minder belangrijke caches worden vervolgens samengevoegd met deze ankers, wat het behoud van contextuele informatie in de KV-caches verbetert en tegelijkertijd een willekeurige versnellingsratio oplevert. Voor instructiecodering gebruiken we de frequentie om het belang van caches te evalueren. Bij uitvoergeneratie prioriteren we tokens op basis van hun afstand met een offset, waarbij zowel de initiële als de meest recente tokens behouden blijven. Resultaten op een reeks LVLMs tonen aan dat Elastic Cache niet alleen de efficiëntie verhoogt, maar ook aanzienlijk beter presteert dan bestaande pruning-methoden in taalgeneratie over verschillende taken. Code is beschikbaar op https://github.com/liuzuyan/ElasticCache.

English

In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an importance-driven cache merging strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their distance with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at https://github.com/liuzuyan/ElasticCache

Efficiënte Inferentie van Visuele Instructie-Volgende Modellen met Elastische Cache

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

Samenvatting

Support