QeRL:超越效率——面向大語言模型的量化增強型強化學習
QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs
October 13, 2025
作者: Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, Yukang Chen
cs.AI
摘要
我們提出了QeRL,一個量化增強型強化學習框架,專為大型語言模型(LLMs)設計。雖然強化學習對於提升LLMs的推理能力至關重要,但其資源消耗巨大,需要大量的GPU記憶體和長時間的rollout階段。QeRL通過結合NVFP4量化與低秩適應(LoRA)技術,有效加速了強化學習的rollout階段,同時降低了記憶體開銷。除了效率提升外,我們的研究發現量化噪聲增加了策略的熵,從而增強了探索能力,使在強化學習過程中能夠發現更優策略。為了進一步優化探索,QeRL引入了自適應量化噪聲(AQN)機制,該機制在訓練過程中動態調整噪聲。實驗結果顯示,QeRL在rollout階段實現了超過1.5倍的加速。此外,這是首個能夠在單個H100 80GB GPU上進行32B LLM強化學習訓練的框架,同時為強化學習訓練帶來了整體的加速。與16位LoRA和QLoRA相比,QeRL實現了更快的獎勵增長和更高的最終準確率,並在7B模型上與全參數微調在數學基準測試如GSM8K(90.8%)和MATH 500(77.4%)上的表現相當。這些成果確立了QeRL作為LLMs強化學習訓練的高效且有效的框架地位。
English
We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for
large language models (LLMs). While RL is essential for LLMs' reasoning
capabilities, it is resource-intensive, requiring substantial GPU memory and
long rollout durations. QeRL addresses these issues by combining NVFP4
quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL
while reducing memory overhead. Beyond efficiency, our findings show that
quantization noise increases policy entropy, enhancing exploration, and
enabling the discovery of better strategies during RL. To further optimize
exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism,
which dynamically adjusts noise during training. Experiments demonstrate that
QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is
the first framework to enable RL training of a 32B LLM on a single H100 80GB
GPU, while delivering overall speedups for RL training. It also achieves faster
reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while
matching the performance of full-parameter fine-tuning on mathematical
benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These
results establish QeRL as an efficient and effective framework for RL training
in LLMs.