穏やかに進めよう：検証可能な強化学習のための探索的アニーリングデコーディング

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデル（LLM）の推論能力を強化するための強力なパラダイムであるが、その成功は効果的な探索戦略にかかっている。理想的な探索戦略は、サンプルの品質を維持しつつ、トレーニングの安定性を確保するという2つの基本的な課題に対処しなければならない。標準的な固定温度サンプリングはシンプルであるが、これらの相反する要求をバランスするのに苦労する。なぜなら、高温ではサンプルの品質が低下し、低温では発見が制限されるからである。本研究では、探索がシーケンスの意味的な方向性を定義する初期のトークンに最も影響を与えるという洞察に基づき、よりシンプルで効果的な戦略である「探索的アニーリングデコーディング（EAD）」を提案する。EADは、生成中にサンプリング温度を高温から低温へとアニーリングすることで、**初期に探索し、終盤に活用する**という直感的な戦略を実装する。この動的なスケジュールは、最初に意味のある高レベルの多様性を促進し、その後、温度を徐々に下げてサンプルの品質を維持し、サンプリング分布をターゲットポリシーに近づけることで、安定したトレーニングに不可欠な条件を満たす。EADは軽量でプラグアンドプレイ可能な方法であり、様々なRLVRアルゴリズムやモデルサイズにおいて、固定温度サンプリングを一貫して上回るサンプル効率の向上を実証する。本研究は、探索を逐次生成の自然なダイナミクスに合わせることが、LLMの推論を改善するための堅牢な道筋を提供することを示唆している。

English

Reinforcement learning with verifiable rewards (RLVR) is a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs), yet its success hinges on effective exploration. An ideal exploration strategy must navigate two fundamental challenges: it must preserve sample quality while also ensuring training stability. While standard fixed-temperature sampling is simple, it struggles to balance these competing demands, as high temperatures degrade sample quality and low temperatures limit discovery. In this work, we propose a simpler and more effective strategy, Exploratory Annealed Decoding (EAD), grounded in the insight that exploration is most impactful on early tokens which define a sequence's semantic direction. EAD implements an intuitive **explore-at-the-beginning, exploit-at-the-end** strategy by annealing the sampling temperature from high to low during generation. This dynamic schedule encourages meaningful, high-level diversity at the start, then gradually lowers the temperature to preserve sample quality and keep the sampling distribution close to the target policy, which is essential for stable training. We demonstrate that EAD is a lightweight, plug-and-play method that significantly improves sample efficiency, consistently outperforming fixed-temperature sampling across various RLVR algorithms and model sizes. Our work suggests that aligning exploration with the natural dynamics of sequential generation offers a robust path to improving LLM reasoning.

穏やかに進めよう：検証可能な強化学習のための探索的アニーリングデコーディング

Let it Calm: Exploratory Annealed Decoding for Verifiable Reinforcement Learning

要旨

Support