基于尺度感知键值缓存压缩的高效记忆视觉自回归建模
Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression
May 26, 2025
作者: Kunjun Li, Zigeng Chen, Cheng-Yen Yang, Jenq-Neng Hwang
cs.AI
摘要
视觉自回归(VAR)建模因其创新的逐尺度预测方法而备受关注,该方法在效率、可扩展性和零样本泛化能力方面带来了显著提升。然而,VAR固有的从粗到细的方法论导致推理过程中键值(KV)缓存呈指数级增长,造成了大量的内存消耗和计算冗余。为解决这些瓶颈,我们提出了ScaleKV,一种专为VAR架构设计的新型KV缓存压缩框架。ScaleKV基于两个关键观察:不同Transformer层对缓存的需求各异,以及不同尺度下的注意力模式存在显著差异。基于这些洞察,ScaleKV将Transformer层划分为两类功能组:草稿层和精修层。草稿层在多个尺度上展现出分散的注意力,因此需要更大的缓存容量。相反,精修层将注意力集中于当前令牌图以处理局部细节,因而所需的缓存容量大幅减少。ScaleKV通过识别特定尺度的草稿层和精修层,优化了多尺度推理流程,实现了针对各尺度的差异化缓存管理。在最新的文本到图像VAR模型家族Infinity上的评估表明,我们的方法在保持像素级保真度的同时,将所需的KV缓存内存有效降低至10%。
English
Visual Autoregressive (VAR) modeling has garnered significant attention for
its innovative next-scale prediction approach, which yields substantial
improvements in efficiency, scalability, and zero-shot generalization.
Nevertheless, the coarse-to-fine methodology inherent in VAR results in
exponential growth of the KV cache during inference, causing considerable
memory consumption and computational redundancy. To address these bottlenecks,
we introduce ScaleKV, a novel KV cache compression framework tailored for VAR
architectures. ScaleKV leverages two critical observations: varying cache
demands across transformer layers and distinct attention patterns at different
scales. Based on these insights, ScaleKV categorizes transformer layers into
two functional groups: drafters and refiners. Drafters exhibit dispersed
attention across multiple scales, thereby requiring greater cache capacity.
Conversely, refiners focus attention on the current token map to process local
details, consequently necessitating substantially reduced cache capacity.
ScaleKV optimizes the multi-scale inference pipeline by identifying
scale-specific drafters and refiners, facilitating differentiated cache
management tailored to each scale. Evaluation on the state-of-the-art
text-to-image VAR model family, Infinity, demonstrates that our approach
effectively reduces the required KV cache memory to 10% while preserving
pixel-level fidelity.Summary
AI-Generated Summary