基於尺度感知鍵值緩存壓縮的記憶體高效視覺自回歸建模

摘要

視覺自回歸（VAR）建模因其創新的下一尺度預測方法而備受關注，該方法在效率、可擴展性和零樣本泛化方面帶來了顯著提升。然而，VAR固有的從粗到細的方法導致推理過程中鍵值（KV）緩存呈指數級增長，造成大量記憶體消耗和計算冗餘。為解決這些瓶頸，我們引入了ScaleKV，這是一個專為VAR架構設計的新型KV緩存壓縮框架。ScaleKV基於兩個關鍵觀察：變壓器層之間不同的緩存需求以及不同尺度下的注意力模式差異。基於這些洞察，ScaleKV將變壓器層分為兩類功能組：草圖生成器和精細化器。草圖生成器在多個尺度上展現分散的注意力，因此需要更大的緩存容量。相反，精細化器將注意力集中在當前令牌圖上以處理局部細節，從而大大減少所需的緩存容量。ScaleKV通過識別特定尺度的草圖生成器和精細化器來優化多尺度推理管道，實現了針對每個尺度的差異化緩存管理。在最先進的文本到圖像VAR模型家族Infinity上的評估表明，我們的方法有效地將所需的KV緩存記憶體減少至10%，同時保持了像素級的保真度。

English

Visual Autoregressive (VAR) modeling has garnered significant attention for its innovative next-scale prediction approach, which yields substantial improvements in efficiency, scalability, and zero-shot generalization. Nevertheless, the coarse-to-fine methodology inherent in VAR results in exponential growth of the KV cache during inference, causing considerable memory consumption and computational redundancy. To address these bottlenecks, we introduce ScaleKV, a novel KV cache compression framework tailored for VAR architectures. ScaleKV leverages two critical observations: varying cache demands across transformer layers and distinct attention patterns at different scales. Based on these insights, ScaleKV categorizes transformer layers into two functional groups: drafters and refiners. Drafters exhibit dispersed attention across multiple scales, thereby requiring greater cache capacity. Conversely, refiners focus attention on the current token map to process local details, consequently necessitating substantially reduced cache capacity. ScaleKV optimizes the multi-scale inference pipeline by identifying scale-specific drafters and refiners, facilitating differentiated cache management tailored to each scale. Evaluation on the state-of-the-art text-to-image VAR model family, Infinity, demonstrates that our approach effectively reduces the required KV cache memory to 10% while preserving pixel-level fidelity.

基於尺度感知鍵值緩存壓縮的記憶體高效視覺自回歸建模

Memory-Efficient Visual Autoregressive Modeling with Scale-Aware KV Cache Compression

摘要

Support