QeRL: Oltre l'efficienza -- Apprendimento per rinforzo potenziato dalla quantizzazione per LLM

Abstract

Proponiamo QeRL, un framework di Reinforcement Learning (RL) potenziato dalla quantizzazione per modelli linguistici di grandi dimensioni (LLM). Sebbene l'RL sia essenziale per le capacità di ragionamento degli LLM, è dispendioso in termini di risorse, richiedendo una notevole quantità di memoria GPU e lunghi tempi di rollout. QeRL affronta questi problemi combinando la quantizzazione NVFP4 con l'Adattamento a Basso Rango (LoRA), accelerando la fase di rollout dell'RL e riducendo l'overhead di memoria. Oltre all'efficienza, i nostri risultati dimostrano che il rumore della quantizzazione aumenta l'entropia della politica, migliorando l'esplorazione e consentendo la scoperta di strategie migliori durante l'RL. Per ottimizzare ulteriormente l'esplorazione, QeRL introduce un meccanismo di Rumore di Quantizzazione Adattivo (AQN), che regola dinamicamente il rumore durante l'addestramento. Gli esperimenti dimostrano che QeRL offre un'accelerazione di oltre 1,5 volte nella fase di rollout. Inoltre, questo è il primo framework che consente l'addestramento RL di un LLM da 32B su una singola GPU H100 80GB, garantendo al contempo accelerazioni complessive per l'addestramento RL. Raggiunge anche una crescita più rapida della ricompensa e una precisione finale più elevata rispetto a LoRA a 16 bit e QLoRA, eguagliando le prestazioni del fine-tuning completo dei parametri su benchmark matematici come GSM8K (90,8%) e MATH 500 (77,4%) nel modello da 7B. Questi risultati stabiliscono QeRL come un framework efficiente ed efficace per l'addestramento RL negli LLM.

English

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

QeRL: Oltre l'efficienza -- Apprendimento per rinforzo potenziato dalla quantizzazione per LLM

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

Abstract

Support