붕괴 전의 번영: 오프-폴리시 강화 학습이 LLM의 오래된 데이터로 어디까지 도달할 수 있는가?

초록

강화 학습은 최근 대규모 언어 모델 추론의 발전에서 핵심적인 역할을 해왔지만, 대부분의 알고리즘은 매 업데이트마다 새로운 롤아웃을 요구하는 온-정책(on-policy) 훈련에 의존하여 효율성과 확장성이 제한된다. 비동기식 강화 학습 시스템은 롤아웃 생성과 훈련을 분리함으로써 이를 완화하지만, 그 효과는 롤아웃 데이터의 큰 지연(staleness)을 허용하는 데 달려 있으며, 이는 기존 방법들이 성능 저하를 겪거나 붕괴되는 상황이다. 우리는 이 문제를 재검토하고, 적절히 활용된다면 지연된 데이터가 온-정책 데이터만큼 유익할 수 있는 "붕괴 전 번영(prosperity-before-collapse)" 현상을 발견했다. 이러한 통찰을 바탕으로, 우리는 중요도 가중치의 두 번째 모멘트를 제한하여 극단적인 이상치만 억제하면서 유익한 업데이트를 보존하는 M2PO(Second-Moment Trust Policy Optimization)를 제안한다. 특히, M2PO는 높은 지연 상황에서 클리핑된 토큰의 비율을 크게 감소시켰으며(훈련 중 1.22%에서 0.06%로), 고분산 토큰을 정확히 마스킹하면서 안정적인 최적화를 유지했다. 1.7B에서 32B까지의 6개 모델과 8개 벤치마크에 걸친 광범위한 평가 결과, M2PO는 최소 256번의 모델 업데이트로 지연된 데이터를 사용하더라도 안정적인 오프-정책(off-policy) 훈련을 제공하며 온-정책 성능과 일치하는 것으로 나타났다.

English

Reinforcement learning has been central to recent advances in large language model reasoning, but most algorithms rely on on-policy training that demands fresh rollouts at every update, limiting efficiency and scalability. Asynchronous RL systems alleviate this by decoupling rollout generation from training, yet their effectiveness hinges on tolerating large staleness in rollout data, a setting where existing methods either degrade in performance or collapse. We revisit this challenge and uncover a prosperity-before-collapse phenomenon: stale data can be as informative as on-policy data if exploited properly. Building on this insight, we introduce M2PO (Second-Moment Trust Policy Optimization), which constrains the second moment of importance weights to suppress only extreme outliers while preserving informative updates. Notably, M2PO sharply reduces the fraction of clipped tokens under high staleness (from 1.22% to 0.06% over training), precisely masking high-variance tokens while maintaining stable optimization. Extensive evaluation across six models (from 1.7B to 32B) and eight benchmarks shows that M2PO delivers stable off-policy training even with data stale by at least 256 model updates and matches on-policy performance.

붕괴 전의 번영: 오프-폴리시 강화 학습이 LLM의 오래된 데이터로 어디까지 도달할 수 있는가?

Prosperity before Collapse: How Far Can Off-Policy RL Reach with Stale Data on LLMs?

초록

Support