ChatPaper.aiChatPaper

自下而上的策略優化:你的語言模型策略中暗藏內部策略

Bottom-up Policy Optimization: Your Language Model Policy Secretly Contains Internal Policies

December 22, 2025
作者: Yuqiao Tan, Minzheng Wang, Shizhu He, Huanxuan Liao, Chengfeng Zhao, Qiunan Lu, Tian Liang, Jun Zhao, Kang Liu
cs.AI

摘要

現有的強化學習方法將大型語言模型視為單一統合策略,忽視了其內部運作機制。因此,理解策略在層級與模組間的演變過程,對於實現更具針對性的優化及釐清複雜推理機制至關重要。本文透過利用Transformer殘差流的內在分割特性,以及隱藏狀態與解嵌入矩陣的組合等效於可採樣策略的特性,對語言模型策略進行分解。此分解揭示了對應單一層級貢獻的內部層級策略,以及與每層中自注意力機制和前饋網路組件對齊的內部模組化策略。透過分析內部策略的熵值,我們發現:(a)底層保持高熵值以進行探索,頂層收斂至接近零熵值以實現精煉,且收斂模式因模型系列而異;(b)LLama的預測空間在最終層快速收斂,而Qwen系列模型(尤其是Qwen3)則展現出更類人化的漸進式結構化推理模式。基於這些發現,我們提出自底向上策略優化——一種在早期訓練階段直接優化內部層級策略的新型強化學習範式。透過在底層對齊訓練目標,BuPO能重建基礎推理能力並實現卓越性能。在複雜推理基準測試上的大量實驗證明了我們方法的有效性。程式碼已開源於:https://github.com/Trae1ounG/BuPO。
English
Existing reinforcement learning (RL) approaches treat large language models (LLMs) as a single unified policy, overlooking their internal mechanisms. Understanding how policy evolves across layers and modules is therefore crucial for enabling more targeted optimization and raveling out complex reasoning mechanisms. In this paper, we decompose the language model policy by leveraging the intrinsic split of the Transformer residual stream and the equivalence between the composition of hidden states with the unembedding matrix and the resulting samplable policy. This decomposition reveals Internal Layer Policies, corresponding to contributions from individual layers, and Internal Modular Policies, which align with the self-attention and feed-forward network (FFN) components within each layer. By analyzing the entropy of internal policy, we find that: (a) Early layers keep high entropy for exploration, top layers converge to near-zero entropy for refinement, with convergence patterns varying across model series. (b) LLama's prediction space rapidly converges in the final layer, whereas Qwen-series models, especially Qwen3, exhibit a more human-like, progressively structured reasoning pattern. Motivated by these findings, we propose Bottom-up Policy Optimization (BuPO), a novel RL paradigm that directly optimizes the internal layer policy during early training. By aligning training objective at lower layer, BuPO reconstructs foundational reasoning capabilities and achieves superior performance. Extensive experiments on complex reasoning benchmarks demonstrates the effectiveness of our method. Our code is available at https://github.com/Trae1ounG/BuPO.
PDF494December 25, 2025