RAMP:面向高效設備端大型語言模型推理的強化自適應混合精度量化
RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference
March 18, 2026
作者: Arpit Singh Gautam, Saurabh Jha
cs.AI
摘要
后训练量化对于在资源受限硬件上部署大语言模型(LLM)至关重要,然而现有先进方法强制所有层采用统一比特宽度,导致准确率与效率的权衡欠佳。我们提出RAMP(强化自适应混合精度)——一种基于离线策略的软演员评论家框架,通过逐层学习比特宽度分配方案,在全局比特预算下最小化困惑度。该策略基于包含激活统计、权重特性与结构描述符的11维嵌入向量进行条件判断,实现了跨模型家族与规模的零样本迁移。为实现稳定的4比特以下量化,我们引入尺度折叠技术:通过逐通道缩放和归一化层补偿,将激活异常值迁移至权重中的预条件处理方法。采用具有非对称惩罚和预算悬崖机制的质量优先奖励函数,驱动策略快速收敛。在Llama 2 7B模型中,RAMP以3.68GB存储空间(3.65有效比特)实现5.54困惑度,相较统一4比特AWQ(3.90GB存储空间下困惑度5.60)和GPTQ,模型体积缩小6%,质量提升1%至3%。关键突破在于:仅基于Llama 2 7B训练的策略可零样本泛化至Llama 2 13B和Mistral 7B,其表现甚至常超越针对特定模型的训练结果,这支持了"量化敏感性主要源于模型架构"的假说。HALO流水线可将分配方案导出为GGUF格式,实现在CPU、GPU及边缘设备上的无内核推理,保留FP16版本常识推理性能的99.5%。
English
Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.