利用弹性缓存高效推断视觉指令跟随模型
Efficient Inference of Vision Instruction-Following Models with Elastic Cache
July 25, 2024
作者: Zuyan Liu, Benlin Liu, Jiahui Wang, Yuhao Dong, Guangyi Chen, Yongming Rao, Ranjay Krishna, Jiwen Lu
cs.AI
摘要
在指令跟随大型视觉语言模型(LVLMs)领域中,这些模型的高效部署面临挑战,主要是由于其键-值(KV)缓存的高内存需求。传统的LLMs缓存管理策略侧重于缓存驱逐,这往往无法满足多模态指令跟随模型的特定需求。鉴于这一差距,在本文中,我们引入了弹性缓存(Elastic Cache),这是一种新颖方法,通过为指令编码和输出生成阶段应用不同的加速方法而获益。我们研究了不同阶段的重要性指标,并提出了一种基于重要性驱动的缓存合并策略来修剪冗余缓存。我们的策略不是丢弃不太重要的缓存,而是将重要的键/值向量识别为锚点。然后,将周围不太重要的缓存与这些锚点合并,增强KV缓存中上下文信息的保留,同时产生任意的加速比。对于指令编码,我们利用频率来评估缓存的重要性。关于输出生成,我们根据它们与偏移的距离优先考虑标记,从而保留初始和最近的标记。在一系列LVLMs上的结果表明,弹性缓存不仅提高了效率,而且在各种任务的语言生成中明显优于现有的修剪方法。代码可在https://github.com/liuzuyan/ElasticCache找到。
English
In the field of instruction-following large vision-language models (LVLMs),
the efficient deployment of these models faces challenges, notably due to the
high memory demands of their key-value (KV) caches. Conventional cache
management strategies for LLMs focus on cache eviction, which often fails to
address the specific needs of multimodal instruction-following models.
Recognizing this gap, in this paper, we introduce Elastic Cache, a novel
approach that benefits from applying distinct acceleration methods for
instruction encoding and output generation stages. We investigate the metrics
of importance in different stages and propose an importance-driven cache
merging strategy to prune redundancy caches. Instead of discarding less
important caches, our strategy identifies important key/value vectors as anchor
points. Surrounding less important caches are then merged with these anchors,
enhancing the preservation of contextual information in the KV caches while
yielding an arbitrary acceleration ratio. For instruction encoding, we utilize
the frequency to evaluate the importance of caches. Regarding output
generation, we prioritize tokens based on their distance with an offset, by
which both the initial and most recent tokens are retained. Results on a range
of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also
notably outperforms existing pruning methods in language generation across
various tasks. Code is available at https://github.com/liuzuyan/ElasticCacheSummary
AI-Generated Summary