透過梯度變異最小化優化拒絕採樣與強化學習中的思維鏈推理器

摘要

大型語言模型（LLMs）中的思維鏈（CoT）推理可以被形式化為一個潛在變量問題，其中模型需要生成中間推理步驟。雖然先前的方法，如迭代獎勵排名微調（RAFT），依賴於此類公式，但它們通常對所有提示應用統一的推理預算，未能考慮到難度和收斂行為的變異性。本工作將CoT訓練中的主要瓶頸識別為由於靜態抽樣策略導致的低效隨機梯度估計。我們提出了GVM-RAFT，一種針對特定提示的動態樣本分配策略，旨在計算預算約束下最小化隨機梯度方差。該方法通過監控提示接受率和隨機梯度範數，動態分配計算資源，確保所得梯度方差最小化。我們的理論分析表明，在適當條件下，所提出的動態抽樣策略能加速收斂保證。在數學推理上的實驗顯示，GVM-RAFT相比於原始RAFT實現了2-4倍的速度提升和顯著的準確性改進。所提出的動態抽樣策略具有通用性，可以整合到其他強化學習算法中，如GRPO，從而帶來類似的收斂和測試準確性提升。我們的代碼可在https://github.com/RLHFlow/GVM獲取。

English

Chain-of-thought (CoT) reasoning in large language models (LLMs) can be formalized as a latent variable problem, where the model needs to generate intermediate reasoning steps. While prior approaches such as iterative reward-ranked fine-tuning (RAFT) have relied on such formulations, they typically apply uniform inference budgets across prompts, which fails to account for variability in difficulty and convergence behavior. This work identifies the main bottleneck in CoT training as inefficient stochastic gradient estimation due to static sampling strategies. We propose GVM-RAFT, a prompt-specific Dynamic Sample Allocation Strategy designed to minimize stochastic gradient variance under a computational budget constraint. The method dynamically allocates computational resources by monitoring prompt acceptance rates and stochastic gradient norms, ensuring that the resulting gradient variance is minimized. Our theoretical analysis shows that the proposed dynamic sampling strategy leads to accelerated convergence guarantees under suitable conditions. Experiments on mathematical reasoning show that GVM-RAFT achieves a 2-4x speedup and considerable accuracy improvements over vanilla RAFT. The proposed dynamic sampling strategy is general and can be incorporated into other reinforcement learning algorithms, such as GRPO, leading to similar improvements in convergence and test accuracy. Our code is available at https://github.com/RLHFlow/GVM.

透過梯度變異最小化優化拒絕採樣與強化學習中的思維鏈推理器

Optimizing Chain-of-Thought Reasoners via Gradient Variance Minimization in Rejection Sampling and RL

摘要

Support