异步轨迹平衡：解耦探索与学习，实现快速可扩展的大语言模型后训练

摘要

强化学习（RL）是大语言模型（LLM）后训练中的关键组成部分。然而，现有用于后训练的同策略算法本质上与经验回放缓冲区的使用不兼容，而后者可通过分布式异策略参与者大规模填充，以在计算资源增加时增强探索能力。我们提出通过异步轨迹平衡（TBA）高效地获取回放缓冲区的这一优势，TBA是一个高度可扩展的LLM RL系统。与现有方法相比，TBA将更大比例的计算资源用于搜索，持续为中心回放缓冲区生成异策略数据。训练节点同时基于奖励或时效性从该缓冲区采样数据，利用轨迹平衡（TB）更新策略，TB是为GFlowNets引入的一种追求多样性的RL目标。TBA具备三大优势：（1）解耦训练与搜索，将训练挂钟时间加速4倍或更多；（2）通过大规模异策略采样提升多样性；（3）在稀疏奖励设置下实现可扩展搜索。在数学推理、偏好调优及自动化红队测试（多样且具代表性的后训练任务）上，TBA相较于强基线模型，均展现出速度与性能的双重提升。

English

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

异步轨迹平衡：解耦探索与学习，实现快速可扩展的大语言模型后训练

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

摘要

Support