基于尺度感知键值缓存压缩的高效记忆视觉自回归建模

摘要

视觉自回归（VAR）建模因其创新的逐尺度预测方法而备受关注，该方法在效率、可扩展性和零样本泛化能力方面带来了显著提升。然而，VAR固有的从粗到细的方法论导致推理过程中键值（KV）缓存呈指数级增长，造成了大量的内存消耗和计算冗余。为解决这些瓶颈，我们提出了ScaleKV，一种专为VAR架构设计的新型KV缓存压缩框架。ScaleKV基于两个关键观察：不同Transformer层对缓存的需求各异，以及不同尺度下的注意力模式存在显著差异。基于这些洞察，ScaleKV将Transformer层划分为两类功能组：草稿层和精修层。草稿层在多个尺度上展现出分散的注意力，因此需要更大的缓存容量。相反，精修层将注意力集中于当前令牌图以处理局部细节，因而所需的缓存容量大幅减少。ScaleKV通过识别特定尺度的草稿层和精修层，优化了多尺度推理流程，实现了针对各尺度的差异化缓存管理。在最新的文本到图像VAR模型家族Infinity上的评估表明，我们的方法在保持像素级保真度的同时，将所需的KV缓存内存有效降低至10%。

English

Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.

基于尺度感知键值缓存压缩的高效记忆视觉自回归建模

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

摘要

Support