OScaR: LLM 및 그 이상에서의 극한 KV 캐시 양자화를 위한 오컴의 면도날

초록

장기 문맥 추론과 다중 모달 지능으로의 급속한 발전은 키-값(KV) 캐시의 메모리 사용량을 효율적인 배포에 있어 지배적인 메모리 병목 현상으로 만들었다. 기존의 채널별 양자화는 Key 텐서의 내재적 채널별 이상치를 효과적으로 수용하지만, 극단적인 압축에서는 그 효용성이 감소한다. 본 연구에서는 경험적 및 이론적 관점에서 채널별 양자화 패러다임의 내재적 한계를 재검토한다. 우리의 분석은 토큰 노름 불균형(TNI)을 양자화 충실도의 주요 병목으로 식별한다. TNI는 공유 양자화 파라미터가 상당한 노름 차이를 보이는 토큰 그룹에 걸쳐 적용되어야 할 때 오류를 체계적으로 증폭시킴을 보여준다. 복잡한 양자화 파이프라인(예: TurboQuant)에 의존하는 대신, 우리는 X-LLM(즉, 텍스트 전용, 다중 모달 및 옴니 모달 LLM)을 위한 정확하고 경량화된 KV 캐시 압축 프레임워크인 OScaR(전방위 스케일 채널화 회전)을 제안한다. 채널별 패러다임을 발전시켜, OScaR은 채널화 회전과 이어지는 전방위 토큰 스케일링을 통해 TNI로 인한 시퀀스 차원 분산을 효과적이면서도 효율적으로 완화하며, 최적화된 시스템 설계 및 CUDA 커널이 이를 뒷받침한다. X-LLM에 걸친 광범위한 평가에서 OScaR은 기존 방법들을 일관되게 능가하고 INT2 양자화 하에서 거의 무손실 성능을 달성하여, 새로운 파레토 최적 경계를 정의하는 강건하고 저복잡도이며 보편적인 프레임워크로 자리매김한다. BF16 FlashDecoding-v2 기준선과 비교하여, 우리의 OScaR 구현은 디코딩에서 최대 3.0배의 속도 향상, 메모리 사용량 5.3배 감소, 처리량 4.1배 증가를 달성한다. OScaR의 코드는 https://github.com/ZunhaiSu/OScaR-KV-Quant에서 공개적으로 이용 가능하다.

English

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.