异步轨迹平衡:解耦探索与学习,实现快速可扩展的大语言模型后训练
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
March 24, 2025
作者: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
cs.AI
摘要
强化学习(RL)是大语言模型(LLM)后训练中的关键组成部分。然而,现有用于后训练的同策略算法本质上与经验回放缓冲区的使用不兼容,而后者可通过分布式异策略参与者大规模填充,以在计算资源增加时增强探索能力。我们提出通过异步轨迹平衡(TBA)高效地获取回放缓冲区的这一优势,TBA是一个高度可扩展的LLM RL系统。与现有方法相比,TBA将更大比例的计算资源用于搜索,持续为中心回放缓冲区生成异策略数据。训练节点同时基于奖励或时效性从该缓冲区采样数据,利用轨迹平衡(TB)更新策略,TB是为GFlowNets引入的一种追求多样性的RL目标。TBA具备三大优势:(1)解耦训练与搜索,将训练挂钟时间加速4倍或更多;(2)通过大规模异策略采样提升多样性;(3)在稀疏奖励设置下实现可扩展搜索。在数学推理、偏好调优及自动化红队测试(多样且具代表性的后训练任务)上,TBA相较于强基线模型,均展现出速度与性能的双重提升。
English
Reinforcement learning (RL) is a critical component of large language model
(LLM) post-training. However, existing on-policy algorithms used for
post-training are inherently incompatible with the use of experience replay
buffers, which can be populated scalably by distributed off-policy actors to
enhance exploration as compute increases. We propose efficiently obtaining this
benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a
massively scalable LLM RL system. In contrast to existing approaches, TBA uses
a larger fraction of compute on search, constantly generating off-policy data
for a central replay buffer. A training node simultaneously samples data from
this buffer based on reward or recency to update the policy using Trajectory
Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA
offers three key advantages: (1) decoupled training and search, speeding up
training wall-clock time by 4x or more; (2) improved diversity through
large-scale off-policy sampling; and (3) scalable search for sparse reward
settings. On mathematical reasoning, preference-tuning, and automated
red-teaming (diverse and representative post-training tasks), TBA produces
speed and performance improvements over strong baselines.Summary
AI-Generated Summary