ChatPaper.aiChatPaper

静待其安:可验证强化学习中的探索性退火解码

Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

October 6, 2025
作者: Chenghao Yang, Lin Gui, Chenxiao Yang, Victor Veitch, Lizhu Zhang, Zhuokai Zhao
cs.AI

摘要

強化學習與可驗證獎勵(RLVR)是一種強大的範式,能夠增強大型語言模型(LLM)的推理能力,但其成功關鍵在於有效的探索。理想的探索策略必須應對兩個基本挑戰:在保持樣本質量的同時,確保訓練的穩定性。雖然標準的固定溫度採樣方法簡單,但難以平衡這些相互競爭的需求,因為高溫度會降低樣本質量,而低溫度則限制了發現的可能性。在本研究中,我們提出了一種更簡單且更有效的策略,即探索性退火解碼(EAD),其核心洞察是探索在定義序列語義方向的早期詞元上最具影響力。EAD通過在生成過程中將採樣溫度從高到低進行退火,實現了一種直觀的**「開始探索,結束利用」**策略。這種動態調度鼓勵在開始時產生有意義的高層次多樣性,然後逐漸降低溫度以保持樣本質量,並使採樣分布接近目標策略,這對於穩定的訓練至關重要。我們證明,EAD是一種輕量級、即插即用的方法,能顯著提高樣本效率,在各種RLVR算法和模型規模中始終優於固定溫度採樣。我們的研究表明,將探索與序列生成的自然動態對齊,為提升LLM推理能力提供了一條穩健的路徑。
English
Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.
PDF73October 8, 2025