Swift-SVD: Theoretisch Optimum ontmoet Praktische Efficiëntie in Low-Rank LLM-compressie

Samenvatting

De inzet van Large Language Models wordt beperkt door de geheugen- en bandbreedte-eisen van statische gewichten en dynamische Key-Value caches. Op SVD gebaseerde compressie biedt een hardwarevriendelijke oplossing om deze kosten te verlagen. Bestaande methoden hebben echter twee belangrijke beperkingen: sommige zijn suboptimaal wat betreft reconstructiefout, terwijl andere wel theoretisch optimaal zijn maar praktisch inefficiënt. In dit artikel stellen we Swift-SVD voor, een activatiebewust, gesloten compressieraamwerk dat gelijktijdig theoretisch optimum, praktische efficiëntie en numerieke stabiliteit garandeert. Swift-SVD aggregeert incrementeel de covariantie van outputactivaties voor een batch inputs en voert een enkele eigenwaardedecompositie uit na aggregatie, waardoor training-vrije, snelle en optimale laaggewijze laag-rang benadering mogelijk wordt. We gebruiken effectieve rang om de lokale laaggewijze comprimeerbaarheid te analyseren en ontwerpen een dynamische rangtoewijzingsstrategie die zowel rekening houdt met lokaal reconstructieverlies als end-to-end laagbelangrijkheid. Uitgebreide experimenten met zes LLM's en acht datasets tonen aan dat Swift-SVD state-of-the-art baseline-methoden overtreft, waarbij optimale compressienauwkeurigheid wordt bereikt en een 3-70x versnelling in end-to-end compressietijd wordt gerealiseerd. Onze code wordt vrijgegeven na acceptatie.

English

The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.

Swift-SVD: Theoretisch Optimum ontmoet Praktische Efficiëntie in Low-Rank LLM-compressie

Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

Samenvatting

Support