RAMP: 効率的なオンデバイスLLM推論のための強化学習適応混合精度量子化

要旨

ポストトレーニング量子化は、リソース制約のあるハードウェア上で大規模言語モデル（LLM）を展開するために不可欠である。しかし、従来の最先端手法では層全体に均一なビット幅を適用するため、精度と効率性のトレードオフが最適化されていない。本論文では、RAMP（強化学習適応混合精度）を提案する。これは、オフポリシー型のSoft Actor-Criticフレームワークを用いて、全体のビット予算の下でパープレキシティを最小化する層ごとのビット幅割り当てを学習する。政策ネットワークは、活性化統計量、重み特性、構造記述子からなる11次元の埋め込みを条件とし、モデルファミリーや規模を超えたゼロショット転移を可能にする。4ビット未満の安定した量子化を実現するため、チャネル単位のスケーリングと正規化層による補償を介して活性化の外れ値を重みに移行する前処理技術であるScale Foldingを導入する。非対称なペナルティと予算制約（バジェットクリフ）を備えた品質優先の報酬関数により、高速な収束を促進する。Llama 2 7Bにおいて、RAMPは3.68GB（実効3.65ビット）で5.54のパープレキシティを達成し、均一4ビットのAWQ（3.90GBで5.60）をサイズで6%、品質で1～3%上回った。重要なことに、Llama 2 7Bのみで学習した政策は、Llama 2 13BおよびMistral 7Bに対してもゼロショットで一般化し、特定ターゲット向けの学習を凌駕することも多く、量子化感度が主にアーキテクチャに依存するという仮説を支持する。HALOパイプラインは割り当てをGGUF形式で出力し、CPU、GPU、エッジデバイス上でカーネル不要の推論を実現し、FP16の常識推論性能の99.5%を維持する。

English

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

RAMP: 効率的なオンデバイスLLM推論のための強化学習適応混合精度量子化

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

要旨

Support