RAMP:面向高效设备端大模型推理的强化自适应混合精度量化方案
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
March 18, 2026
作者: Arpit Singh Gautam, Saurabh Jha
cs.AI
摘要
后训练量化对于在资源受限硬件上部署大语言模型至关重要,但现有先进方法强制所有层采用统一比特宽度,导致精度与效率的权衡欠佳。我们提出RAMP(强化自适应混合精度)——一种基于离策略软演员-评论家框架的方法,该框架通过学习逐层比特宽度分配策略,在全局比特预算下最小化模型困惑度。该策略基于11维嵌入向量进行决策,该向量融合了激活统计量、权重特性与结构描述符,从而实现跨模型家族与规模的零样本迁移。为实现稳定的4比特以下量化,我们提出尺度折叠技术,这种预处理方法通过逐通道缩放和归一化层补偿,将激活异常值迁移至权重中。采用非对称惩罚和预算悬崖机制的质量优先奖励函数可驱动策略快速收敛。在Llama 2 7B模型上,RAMP以3.68GB存储(3.65有效比特)实现5.54困惑度,相较统一4比特AWQ(3.90GB存储下困惑度5.60)模型体积减小6%,质量提升1%-3%。关键的是,仅基于Llama 2 7B训练的策略可零样本泛化至Llama 2 13B和Mistral 7B,其表现甚至常优于针对特定目标的训练,这支持了“量化敏感性主要源于架构特性”的假设。HALO流水线可将比特分配方案导出为GGUF格式,实现在CPU、GPU及边缘设备上的无内核推理,保留FP16版本99.5%的常识推理性能。
English
Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.