QeRL：超越效率——面向大語言模型的量化增強型強化學習

摘要

我們提出了QeRL，一個量化增強型強化學習框架，專為大型語言模型（LLMs）設計。雖然強化學習對於提升LLMs的推理能力至關重要，但其資源消耗巨大，需要大量的GPU記憶體和長時間的rollout階段。QeRL通過結合NVFP4量化與低秩適應（LoRA）技術，有效加速了強化學習的rollout階段，同時降低了記憶體開銷。除了效率提升外，我們的研究發現量化噪聲增加了策略的熵，從而增強了探索能力，使在強化學習過程中能夠發現更優策略。為了進一步優化探索，QeRL引入了自適應量化噪聲（AQN）機制，該機制在訓練過程中動態調整噪聲。實驗結果顯示，QeRL在rollout階段實現了超過1.5倍的加速。此外，這是首個能夠在單個H100 80GB GPU上進行32B LLM強化學習訓練的框架，同時為強化學習訓練帶來了整體的加速。與16位LoRA和QLoRA相比，QeRL實現了更快的獎勵增長和更高的最終準確率，並在7B模型上與全參數微調在數學基準測試如GSM8K（90.8%）和MATH 500（77.4%）上的表現相當。這些成果確立了QeRL作為LLMs強化學習訓練的高效且有效的框架地位。

English

We propose QeRL, a Quantization-enhanced Reinforcement Learning framework for large language models (LLMs). While RL is essential for LLMs' reasoning capabilities, it is resource-intensive, requiring substantial GPU memory and long rollout durations. QeRL addresses these issues by combining NVFP4 quantization with Low-Rank Adaptation (LoRA), accelerating rollout phase of RL while reducing memory overhead. Beyond efficiency, our findings show that quantization noise increases policy entropy, enhancing exploration, and enabling the discovery of better strategies during RL. To further optimize exploration, QeRL introduces an Adaptive Quantization Noise (AQN) mechanism, which dynamically adjusts noise during training. Experiments demonstrate that QeRL delivers over 1.5 times speedup in the rollout phase. Moreover, this is the first framework to enable RL training of a 32B LLM on a single H100 80GB GPU, while delivering overall speedups for RL training. It also achieves faster reward growth and higher final accuracy than 16-bit LoRA and QLoRA, while matching the performance of full-parameter fine-tuning on mathematical benchmarks such as GSM8K (90.8%) and MATH 500 (77.4%) in the 7B model. These results establish QeRL as an efficient and effective framework for RL training in LLMs.

QeRL：超越效率——面向大語言模型的量化增強型強化學習

QeRL: Beyond Efficiency -- Quantization-enhanced Reinforcement Learning for LLMs

摘要

Support