비동기성을 고려한 궤적 균형: 빠르고 확장 가능한 LLM 사후 학습을 위한 탐색과 학습의 분리

초록

강화 학습(Reinforcement Learning, RL)은 대규모 언어 모델(Large Language Model, LLM)의 사후 훈련(post-training)에서 중요한 구성 요소입니다. 그러나 현재 사후 훈련에 사용되는 온-정책(on-policy) 알고리즘은 경험 재생 버퍼(experience replay buffer)의 사용과 본질적으로 호환되지 않습니다. 이러한 버퍼는 분산된 오프-정책(off-policy) 액터를 통해 확장 가능하게 채워질 수 있으며, 이는 컴퓨팅 자원이 증가함에 따라 탐색을 강화하는 데 도움을 줄 수 있습니다. 우리는 이러한 재생 버퍼의 이점을 "비동기적 궤적 균형(Trajectory Balance with Asynchrony, TBA)"이라는 대규모 확장 가능한 LLM RL 시스템을 통해 효율적으로 얻는 방법을 제안합니다. 기존 접근 방식과 달리, TBA는 검색에 더 많은 컴퓨팅 자원을 할당하며, 지속적으로 오프-정책 데이터를 생성하여 중앙 재생 버퍼에 공급합니다. 훈련 노드는 이 버퍼에서 보상이나 최신성을 기준으로 데이터를 샘플링하여 정책을 업데이트하는데, 이때 GFlowNets를 위해 도입된 다양성 추구 RL 목표인 "궤적 균형(Trajectory Balance, TB)"을 사용합니다. TBA는 세 가지 주요 장점을 제공합니다: (1) 훈련과 검색의 분리로 인해 훈련 시간을 4배 이상 단축, (2) 대규모 오프-정책 샘플링을 통한 다양성 향상, (3) 희소 보환 설정에서의 확장 가능한 검색. 수학적 추론, 선호도 튜닝, 자동화된 레드 팀링(다양하고 대표적인 사후 훈련 작업)에서 TBA는 강력한 베이스라인 대비 속도와 성능 향상을 보여줍니다.

English

Reinforcement learning (RL) is a critical component of large language model (LLM) post-training. However, existing on-policy algorithms used for post-training are inherently incompatible with the use of experience replay buffers, which can be populated scalably by distributed off-policy actors to enhance exploration as compute increases. We propose efficiently obtaining this benefit of replay buffers via Trajectory Balance with Asynchrony (TBA), a massively scalable LLM RL system. In contrast to existing approaches, TBA uses a larger fraction of compute on search, constantly generating off-policy data for a central replay buffer. A training node simultaneously samples data from this buffer based on reward or recency to update the policy using Trajectory Balance (TB), a diversity-seeking RL objective introduced for GFlowNets. TBA offers three key advantages: (1) decoupled training and search, speeding up training wall-clock time by 4x or more; (2) improved diversity through large-scale off-policy sampling; and (3) scalable search for sparse reward settings. On mathematical reasoning, preference-tuning, and automated red-teaming (diverse and representative post-training tasks), TBA produces speed and performance improvements over strong baselines.

비동기성을 고려한 궤적 균형: 빠르고 확장 가능한 LLM 사후 학습을 위한 탐색과 학습의 분리

Trajectory Balance with Asynchrony: Decoupling Exploration and Learning for Fast, Scalable LLM Post-Training

초록

Support