异步RLHF：用于语言模型的更快、更高效的离策略强化学习

摘要

RLHF的主导范式是在线和在线策略RL：同步地从大型语言模型（LLM）策略生成，使用奖励模型进行标记，并利用LLM自身输出的反馈进行学习。尽管性能优越，但这种范式在计算上效率低下。受经典深度RL文献启发，我们提出在RLHF中分离生成和学习。这使得可以异步生成新样本，同时在旧样本上进行训练，从而实现更快的训练和更优化的计算扩展。然而，异步训练依赖于一个未经深入探讨的领域，即在线但离线策略RLHF：在我们模型先前迭代的样本上进行学习。为了了解这个领域中的挑战，我们研究一个基本问题：我们能够容忍多少离线策略性以加速学习但保持性能？在测试的几种RLHF算法中，我们发现在线DPO对离线数据最具鲁棒性，并且鲁棒性随策略模型规模的增加而增加。我们进一步研究了异步RLHF的计算优化，但发现这些优化会带来性能成本，形成一种权衡。最后，我们通过在指令跟随任务上训练LLaMA 3.1 8B来验证异步RLHF的可扩展性，比同步运行快40%，同时保持最终性能。

English

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

异步RLHF：用于语言模型的更快、更高效的离策略强化学习

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

摘要

Support