RAMP: Reinforcement Adaptieve Gemengde Precisie-kwantisering voor Efficiënte On-Device LLM-inferentie

Samenvatting

Post-training kwantificatie is essentieel voor het implementeren van grote taalmodellen (LLM's) op hardware met beperkte middelen, maar state-of-the-art methoden leggen uniforme bitbreedtes op over alle lagen heen, wat suboptimale nauwkeurigheid-efficiëntie-afwegingen oplevert. Wij presenteren RAMP (Reinforcement Adaptive Mixed Precision), een off-policy Soft Actor Critic-framework dat per laag bitbreedte-toewijzingen leert om de perplexiteit onder een globaal bitbudget te minimaliseren. Het beleid is gebaseerd op een 11-dimensionale inbedding van activatiewaarden-statistieken, gewichtseigenschappen en structurele beschrijvers, wat zero-shot-transfer tussen modelfamilies en -schalen mogelijk maakt. Om stabiele kwantificatie onder de 4 bit mogelijk te maken, introduceren wij Scale Folding, een preconditioneringstechniek die uitbijters in activatiewaarden naar de gewichten migreert via per-kanaal-schaling en compensatie van normalisatielagen. Een op kwaliteit geprioriteerde beloning met asymmetrische straffen en budget-'cliffs' zorgt voor snelle convergentie. Op Llama 2 7B bereikt RAMP een perplexiteit van 5,54 bij 3,68 GB (3,65 effectieve bits), wat beter presteert dan uniforme 4-bit AWQ (5,60 bij 3,90 GB) en GPTQ met 6% in grootte en 1% tot 3% in kwaliteit. Cruciaal is dat een beleid dat alleen op Llama 2 7B is getraind, zich zero-shot generaliseert naar Llama 2 13B en Mistral 7B, en vaak modelspecifieke training overtreft, wat de hypothese ondersteunt dat kwantificatiegevoeligheid primair architecturaal is. De HALO-pijplijn exporteert toewijzingen naar GGUF-formaat voor kernel-vrije inferentie op CPU's, GPU's en edge-apparaten, waarbij 99,5% van de FP16 common sense-redeneerprestaties behouden blijft.

English

Post training quantization is essential for deploying large language models (LLMs) on resource constrained hardware, yet state of the art methods enforce uniform bit widths across layers, yielding suboptimal accuracy efficiency trade offs. We present RAMP (Reinforcement Adaptive Mixed Precision), an off policy Soft Actor Critic framework that learns per layer bit width assignments to minimize perplexity under a global bit budget. The policy conditions on an 11 dimensional embedding of activation statistics, weight properties, and structural descriptors, enabling zero shot transfer across model families and scales. To enable stable sub 4 bit quantization, we introduce Scale Folding, a preconditioning technique that migrates activation outliers into weights via per channel scaling and normalization layer compensation. A quality prioritized reward with asymmetric penalties and budget cliffs drives rapid convergence. On Llama 2 7B, RAMP achieves 5.54 perplexity at 3.68GB (3.65 effective bits), outperforming uniform 4 bit AWQ (5.60 at 3.90 GB) and GPTQ by 6% in size and 1% to3% in quality. Critically, a policy trained only on Llama 2 7B generalizes zero shot to Llama 2 13B and Mistral 7B, often surpassing target specific training, supporting the hypothesis that quantization sensitivity is primarily architectural. The HALO pipeline exports allocations to GGUF format for kernel free inference on CPUs, GPUs, and edge devices, retaining 99.5% of FP16 commonsense reasoning performance.

RAMP: Reinforcement Adaptieve Gemengde Precisie-kwantisering voor Efficiënte On-Device LLM-inferentie

RAMP: Reinforcement Adaptive Mixed Precision Quantization for Efficient On Device LLM Inference

Samenvatting

Support