QeRL:超越效率——面向大语言模型的量化增强强化学习
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
October 13, 2025
作者: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
cs.AI
摘要
我们提出了QeRL,一种面向大语言模型(LLMs)的量化增强型强化学习框架。尽管强化学习(RL)对LLMs的推理能力至关重要,但其资源消耗巨大,需要大量GPU内存和较长的推演周期。QeRL通过结合NVFP4量化技术与低秩适应(LoRA),有效加速了RL的推演阶段,同时降低了内存开销。除了效率提升外,我们的研究发现,量化噪声增加了策略熵,增强了探索能力,使RL过程中能发现更优策略。为进一步优化探索,QeRL引入了自适应量化噪声(AQN)机制,在训练过程中动态调整噪声。实验表明,QeRL在推演阶段实现了超过1.5倍的加速。此外,这是首个能够在单块H100 80GB GPU上对32B LLM进行RL训练,并实现RL训练整体加速的框架。与16位LoRA和QLoRA相比,QeRL不仅实现了更快的奖励增长和更高的最终准确率,还在7B模型上,在数学基准测试如GSM8K(90.8%)和MATH 500(77.4%)中,与全参数微调性能相当。这些成果确立了QeRL作为LLMs中高效且有效的RL训练框架的地位。
English
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for
large language models (LLMs). While RL is essential for LLMs' reasoning
capabilities, it is resource-intensive, requiring substantial GPU memory and
long rollout durations. QeRL addresses these issues by combining NVFP4
quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL
while reducing memory overhead. Beyond efficiency, our findings show that
quantization noise increases policy entropy, enhancing exploration, and
enabling the discovery of better strategies during RL. To further optimize
exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism,
which dynamically adjusts noise during training. Experiments demonstrate that
QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is
the first framework to enable RL training of a 32B LLM on a single H100 80GB
GPU, while delivering overall speedups for RL training. It also achieves faster
reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while
matching the performance of full-parameter fine-tuning on mathematical
benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These
results establish QeRL as an efficient and effective framework for RL training
in LLMs.