TreePO: 휴리스틱 트리 기반 모델링을 통해 정책 최적화와 효율성 및 추론 효율성 간의 격차 해소

초록

최근 강화 학습을 통해 대규모 언어 모델을 정렬하는 기술의 발전은 복잡한 추론 문제 해결에서 놀라운 성과를 달성했지만, 비용이 많이 드는 온-정책 롤아웃과 다양한 추론 경로 탐색의 제한이라는 대가를 치러야 했습니다. 본 연구에서는 시퀀스 생성을 트리 구조 탐색 과정으로 보는 자체 주도형 롤아웃 알고리즘인 TreePO를 소개합니다. 동적 트리 샘플링 정책과 고정 길이 세그먼트 디코딩으로 구성된 TreePO는 지역적 불확실성을 활용하여 추가 분기를 보장합니다. 공통 접두어 간의 계산을 분할하고 낮은 가치의 경로를 조기에 제거함으로써, TreePO는 업데이트당 계산 부담을 크게 줄이면서도 탐색 다양성을 유지하거나 향상시킵니다. 주요 기여점은 다음과 같습니다: (1) 연속적인 세그먼트를 통해 KV 캐시 부담을 완화하고 조기 중단 메커니즘과 함께 새로운 분기를 생성하는 세그먼트 단위 샘플링 알고리즘, (2) 전역 및 지역 근접 정책 최적화를 모두 고려하는 트리 기반 세그먼트 수준 이점 추정, 그리고 (3) 확률 및 품질 기반 동적 발산 및 폴백 전략의 효과성 분석. 우리는 TreePO의 성능 향상을 일련의 추론 벤치마크에서 실증적으로 검증하고, 훈련된 모델의 샘플링 설계에서 GPU 시간을 22%에서 43%까지 절약하는 효율성을 보여주었으며, 기존 모델의 궤적 수준에서 최대 40%, 토큰 수준에서 최대 35%의 샘플링 계산 감소를 입증했습니다. TreePO는 추론 효율성의 무료 점심을 제공하면서도, 더 적은 샘플과 계산으로 RL 기반 사후 훈련을 확장하는 실용적인 길을 제시합니다. 홈페이지는 https://m-a-p.ai/TreePO에서 확인할 수 있습니다.

English

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

TreePO: 휴리스틱 트리 기반 모델링을 통해 정책 최적화와 효율성 및 추론 효율성 간의 격차 해소

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

초록

Support