軌跡平衡與異步性:解耦探索與學習,實現快速、可擴展的大語言模型後訓練
Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training
March 24, 2025
作者: Brian R. Bartoldson, Siddarth Venkatraman, James Diffenderfer, Moksh Jain, Tal Ben-Nun, Seanie Lee, Minsu Kim, Johan Obando-Ceron, Yoshua Bengio, Bhavya Kailkhura
cs.AI
摘要
強化學習(RL)是大語言模型(LLM)後訓練中的關鍵組成部分。然而,現有的用於後訓練的同策略算法本質上與經驗回放緩衝區的使用不相容,而這些緩衝區可以通過分佈式的異策略參與者進行可擴展的填充,以隨著計算資源的增加來增強探索。我們提出通過異步軌跡平衡(TBA)高效地獲得回放緩衝區的這一優勢,這是一個大規模可擴展的LLM RL系統。與現有方法相比,TBA將更大比例的計算資源用於搜索,持續生成異策略數據以填充中央回放緩衝區。訓練節點同時根據獎勵或最近性從該緩衝區中採樣數據,並使用軌跡平衡(TB)來更新策略,這是一種為GFlowNets引入的追求多樣性的RL目標。TBA提供了三個關鍵優勢:(1)解耦的訓練和搜索,將訓練的實際時間加速4倍或更多;(2)通過大規模異策略採樣提高多樣性;(3)在稀疏獎勵設置下進行可擴展的搜索。在數學推理、偏好調優和自動紅隊測試(多樣且具代表性的後訓練任務)中,TBA在速度和性能上均優於強基準模型。
English
Reinforcement learning (RL) is a critical component of large language model
(LLM) post-training. However, existing on-policy algorithms used for
post-training are inherently incompatible with the use of experience replay
buffers, which can be populated scalably by distributed off-policy actors to
enhance exploration as compute increases. We propose efficiently obtaining this
benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a
massively scalable LLM RL system. In contrast to existing approaches, TBA uses
a larger fraction of compute on search, constantly generating off-policy data
for a central replay buffer. A training node simultaneously samples data from
this buffer based on reward or recency to update the policy using Trajectory
Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA
offers three key advantages: (1) decoupled training and search, speeding up
training wall-clock time by 4x or more; (2) improved diversity through
large-scale off-policy sampling; and (3) scalable search for sparse reward
settings. On mathematical reasoning, preference-tuning, and automated
red-teaming (diverse and representative post-training tasks), TBA produces
speed and performance improvements over strong baselines.Summary
AI-Generated Summary