TreePO：基於啟發式樹狀建模的政策優化與效能及推理效率之橋接

摘要

近期，通過強化學習對齊大型語言模型的技術取得了顯著進展，在解決複雜推理問題方面獲得了顯著提升，但這也伴隨著昂貴的在線策略展開和對多樣化推理路徑探索的局限。本研究提出了TreePO，引入了一種自我引導的展開算法，將序列生成視為樹結構的搜索過程。TreePO結合了動態樹採樣策略和固定長度片段解碼，利用局部不確定性來保證額外分支的生成。通過在共同前綴上分攤計算並早期剪枝低價值路徑，TreePO本質上減少了每次更新的計算負擔，同時保持或增強了探索的多樣性。主要貢獻包括：(1) 一種分段採樣算法，通過連續片段減輕KV緩存的負擔，並伴隨早期停止機制生成新分支；(2) 基於樹的片段級優勢估計，考慮了全局和局部的近端策略優化；(3) 對概率和質量驅動的動態分歧及回退策略有效性的分析。我們在系列推理基準上實證驗證了TreePO的性能提升，並展示了訓練模型採樣設計的GPU小時效率節省從22%到43%，同時現有模型在軌跡級和令牌級採樣計算上分別減少了40%和35%。在提供推理效率的“免費午餐”的同時，TreePO揭示了一條實用路徑，即通過更少的樣本和計算來擴展基於強化學習的後訓練。主頁位於https://m-a-p.ai/TreePO。

English

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

TreePO：基於啟發式樹狀建模的政策優化與效能及推理效率之橋接

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

摘要

Support