QeRL：超越效率——面向大语言模型的量化增强强化学习

摘要

我们提出了QeRL，一种面向大语言模型（LLMs）的量化增强型强化学习框架。尽管强化学习（RL）对LLMs的推理能力至关重要，但其资源消耗巨大，需要大量GPU内存和较长的推演周期。QeRL通过结合NVFP4量化技术与低秩适应（LoRA），有效加速了RL的推演阶段，同时降低了内存开销。除了效率提升外，我们的研究发现，量化噪声增加了策略熵，增强了探索能力，使RL过程中能发现更优策略。为进一步优化探索，QeRL引入了自适应量化噪声（AQN）机制，在训练过程中动态调整噪声。实验表明，QeRL在推演阶段实现了超过1.5倍的加速。此外，这是首个能够在单块H100 80GB GPU上对32B LLM进行RL训练，并实现RL训练整体加速的框架。与16位LoRA和QLoRA相比，QeRL不仅实现了更快的奖励增长和更高的最终准确率，还在7B模型上，在数学基准测试如GSM8K（90.8%）和MATH 500（77.4%）中，与全参数微调性能相当。这些成果确立了QeRL作为LLMs中高效且有效的RL训练框架的地位。

English

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

QeRL：超越效率——面向大语言模型的量化增强强化学习

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

摘要

Support