利用彈性快取高效推論視覺指示遵循模型

摘要

在指示遵循的大型視覺語言模型（LVLMs）領域中，這些模型的高效部署面臨挑戰，主要是由於它們的鍵-值（KV）緩存對內存的高需求。傳統的LLMs緩存管理策略主要集中在緩存淘汰上，這通常無法滿足多模式指示遵循模型的特定需求。鑒於這一差距，本文介紹了彈性緩存，這是一種新穎方法，通過為指示編碼和輸出生成階段應用不同的加速方法而受益。我們研究了不同階段的重要性指標，並提出了一種基於重要性驅動的緩存合併策略來修剪冗餘緩存。我們的策略不是丟棄不太重要的緩存，而是識別重要的鍵/值向量作為錨點。然後將周圍不太重要的緩存與這些錨點合併，增強KV緩存中上下文信息的保留，同時產生任意的加速比。對於指示編碼，我們利用頻率來評估緩存的重要性。在輸出生成方面，我們根據它們與偏移的距離來優先考慮令牌，從而保留初始和最近的令牌。對一系列LVLMs的結果表明，彈性緩存不僅提高了效率，而且在各種任務的語言生成中明顯優於現有的修剪方法。代碼可在https://github.com/liuzuyan/ElasticCache找到。

English

In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an importance-driven cache merging strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their distance with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at https://github.com/liuzuyan/ElasticCache

利用彈性快取高效推論視覺指示遵循模型

Efficient Inference of Vision Instruction-Following Models with Elastic Cache

摘要

Summary

Support

Support