Swift-SVD: 이론적 최적성과 실용적 효율성을 겸비한 저순위 LLM 압축 기술

초록

대규모 언어 모델의 배포는 정적 가중치와 동적 키-값 캐시의 메모리 및 대역폭 요구 사항으로 인해 제약을 받습니다. SVD 기반 압축은 이러한 비용을 줄이기 위한 하드웨어 친화적인 솔루션을 제공합니다. 그러나 기존 방법들은 두 가지 주요 한계를 지니고 있습니다. 일부 방법은 재구성 오차 측면에서 차선책이며, 다른 방법들은 이론적으로 최적이지만 실제로는 비효율적입니다. 본 논문에서는 이론적 최적성, 실용적 효율성 및 수치적 안정성을 동시에 보장하는 활성화 인식 폐쇄형 압축 프레임워크인 Swift-SVD를 제안합니다. Swift-SVD는 입력 배치에 대한 출력 활성화의 공분산을 점진적으로 집계하고 집계 후 단일 고유값 분해를 수행함으로써 학습이 필요 없고 빠르며 최적의 계층별 저순위 근사를 가능하게 합니다. 우리는 지역적 계층별 압축 가능성을 분석하기 위해 효과적 랭크를 활용하고, 지역적 재구성 손실과 종단간 계층 중요도를 함께 고려하는 동적 랭크 할당 전략을 설계합니다. 6개의 LLM과 8개의 데이터셋에 대한 폭넓은 실험을 통해 Swift-SVD가 최신 기준선들을 능가하며, 최적의 압축 정확도를 달성하고 종단간 압축 시간에서 3~70배의 속도 향상을 제공함을 입증합니다. 본 논문의 승인 후 코드를 공개할 예정입니다.

English

The deployment of Large Language Models is constrained by the memory and bandwidth demands of static weights and dynamic Key-Value cache. SVD-based compression provides a hardware-friendly solution to reduce these costs. However, existing methods suffer from two key limitations: some are suboptimal in reconstruction error, while others are theoretically optimal but practically inefficient. In this paper, we propose Swift-SVD, an activation-aware, closed-form compression framework that simultaneously guarantees theoretical optimum, practical efficiency and numerical stability. Swift-SVD incrementally aggregates covariance of output activations given a batch of inputs and performs a single eigenvalue decomposition after aggregation, enabling training-free, fast, and optimal layer-wise low-rank approximation. We employ effective rank to analyze local layer-wise compressibility and design a dynamic rank allocation strategy that jointly accounts for local reconstruction loss and end-to-end layer importance. Extensive experiments across six LLMs and eight datasets demonstrate that Swift-SVD outperforms state-of-the-art baselines, achieving optimal compression accuracy while delivering 3-70X speedups in end-to-end compression time. Our code will be released upon acceptance.

Swift-SVD: 이론적 최적성과 실용적 효율성을 겸비한 저순위 LLM 압축 기술

Swift-SVD: Theoretical Optimality Meets Practical Efficiency in Low-Rank LLM Compression

초록

Support