TreePO：基于启发式树形建模，弥合策略优化效能与推理效率的鸿沟

摘要

近期，通过强化学习对齐大型语言模型的进展在解决复杂推理问题上取得了显著成效，但代价是昂贵的在线策略展开和对多样化推理路径的有限探索。本文提出TreePO，引入了一种自引导的展开算法，将序列生成视为树结构搜索过程。TreePO由动态树采样策略和固定长度片段解码组成，利用局部不确定性来保证额外分支的生成。通过分摊公共前缀的计算并及早修剪低价值路径，TreePO在保持或增强探索多样性的同时，实质上降低了每次更新的计算负担。主要贡献包括：(1) 一种分段采样算法，通过连续片段减轻KV缓存负担，并伴随早停机制生成新分支；(2) 一种基于树的片段级优势估计，兼顾全局和局部的近端策略优化；(3) 对概率和质量驱动的动态分歧及回退策略有效性的分析。我们在多个推理基准上实证验证了TreePO的性能提升，并展示了采样设计为训练模型节省了22%至43%的GPU小时，同时对现有模型在轨迹级和令牌级采样计算上分别实现了高达40%和35%的减少。在提供推理效率“免费午餐”的同时，TreePO揭示了基于强化学习的后训练规模化的一条实用路径，即使用更少的样本和计算资源。项目主页位于https://m-a-p.ai/TreePO。

English

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

TreePO：基于启发式树形建模，弥合策略优化效能与推理效率的鸿沟

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

摘要

Support