RAMP: 효율적인 온디바이스 LLM 추론을 위한 강화 학습 기반 적응형 혼합 정밀도 양자화

초록

훈련 후 양자화는 리소스가 제한된 하드웨어에 대규모 언어 모델(LLM)을 배포하는 데 필수적이지만, 최신 방법들은 계층 전체에 균일한 비트 폭을 적용하여 정확도와 효율성 간의 최적이 아닌 트레이드오프를 초래합니다. 본 논문에서는 전역 비트 예산 하에서 퍼플렉서티를 최소화하기 위해 계층별 비트 폭 할당을 학습하는 오프-정책 Soft Actor-Critic 프레임워크인 RAMP(Reinforcement Adaptive Mixed Precision)를 제안합니다. 이 정책은 활성화 통계, 가중치 특성 및 구조적 설명자를 11차원으로 임베딩한 정보를 조건으로 하여, 모델 패밀리와 규모에 관계없이 제로샷 전이가 가능합니다. 안정적인 4비트 미만 양자화를 가능하게 하기 위해, 채널별 스케일링과 정규화 계층 보상을 통해 활성화 이상치를 가중치로 이전하는 전처리 기법인 Scale Folding을 도입합니다. 비대칭 패널티와 예산 한계를 포함한 품질 우선 보상 함수는 빠른 수렴을 유도합니다. Llama 2 7B 모델에서 RAMP는 3.68GB(유효 비트 3.65비트)에서 5.54의 퍼플렉서티를 달성하여, 균일 4비트 AWQ(3.90GB에서 5.60)보다 크기는 6% 더 작으면서 품질은 1%~3% 우수하며 GPTQ도 능가합니다. 중요한 것은, Llama 2 7B만으로 훈련된 정책이 Llama 2 13B와 Mistral 7B로 제로샷 일반화되며 종종 특정 대상 훈련을 능가하는데, 이는 양자화 민감도가 주로 아키텍처에 기인한다는 가설을 지지합니다. HALO 파이프라인은 할당 결과를 GGUF 형식으로 내보내 CPU, GPU 및 에지 디바이스에서 커널 없이 추론이 가능하게 하며, FP16 기준 상식 추론 성능의 99.5%를 유지합니다.

English

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

RAMP: 효율적인 온디바이스 LLM 추론을 위한 강화 학습 기반 적응형 혼합 정밀도 양자화

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

초록

Support