TreePO: ヒューリスティックなツリーベースモデリングによるポリシー最適化の有効性と推論効率のギャップを埋める

要旨

大規模言語モデルの強化学習によるアラインメントの最近の進展は、複雑な推論問題の解決において顕著な成果を上げてきたが、その代償として高コストなオン・ポリシー・ロールアウトと多様な推論経路の探索の限界が生じている。本研究では、シーケンス生成を木構造探索プロセスとして捉える自己誘導型ロールアウトアルゴリズムを導入したTreePOを提案する。動的な木サンプリングポリシーと固定長セグメントデコードから構成されるTreePOは、局所的な不確実性を活用して追加の分岐を保証する。共通の接頭辞を償却し、低価値の経路を早期に刈り込むことで、TreePOは更新ごとの計算負荷を本質的に軽減しつつ、探索の多様性を維持または向上させる。主な貢献は以下の通りである：(1) 連続したセグメントを通じてKVキャッシュの負荷を軽減し、早期停止メカニズムと共に新しい分岐を生成するセグメント単位のサンプリングアルゴリズム、(2) グローバルおよびローカルの近接ポリシー最適化を考慮した木ベースのセグメントレベル優位性推定、(3) 確率と品質に基づく動的分岐とフォールバック戦略の有効性に関する分析。我々は、一連の推論ベンチマークにおけるTreePOの性能向上と、訓練済みモデルのサンプリング設計におけるGPU時間の22％から43％の効率化を実証的に検証し、既存モデルにおいて軌跡レベルで最大40％、トークンレベルで最大35％のサンプリング計算削減を示した。推論効率の「フリーランチ」を提供する一方で、TreePOは、より少ないサンプルと計算量でRLベースのポストトレーニングをスケールするための実用的な道筋を明らかにしている。ホームページはhttps://m-a-p.ai/TreePOにあります。

English

Recent advancements in aligning large language models via reinforcement learning have achieved remarkable gains in solving complex reasoning problems, but at the cost of expensive on-policy rollouts and limited exploration of diverse reasoning paths. In this work, we introduce TreePO, involving a self-guided rollout algorithm that views sequence generation as a tree-structured searching process. Composed of dynamic tree sampling policy and fixed-length segment decoding, TreePO leverages local uncertainty to warrant additional branches. By amortizing computation across common prefixes and pruning low-value paths early, TreePO essentially reduces the per-update compute burden while preserving or enhancing exploration diversity. Key contributions include: (1) a segment-wise sampling algorithm that alleviates the KV cache burden through contiguous segments and spawns new branches along with an early-stop mechanism; (2) a tree-based segment-level advantage estimation that considers both global and local proximal policy optimization. and (3) analysis on the effectiveness of probability and quality-driven dynamic divergence and fallback strategy. We empirically validate the performance gain of TreePO on a set reasoning benchmarks and the efficiency saving of GPU hours from 22\% up to 43\% of the sampling design for the trained models, meanwhile showing up to 40\% reduction at trajectory-level and 35\% at token-level sampling compute for the existing models. While offering a free lunch of inference efficiency, TreePO reveals a practical path toward scaling RL-based post-training with fewer samples and less compute. Home page locates at https://m-a-p.ai/TreePO.

TreePO: ヒューリスティックなツリーベースモデリングによるポリシー最適化の有効性と推論効率のギャップを埋める

TreePO: Bridging the Gap of Policy Optimization and Efficacy and Inference Efficiency with Heuristic Tree-based Modeling

要旨

Support