SofT-GRPO：通过Gumbel重参数化软思维策略优化超越离散令牌大语言模型强化学习

摘要

大型语言模型（LLM）的软性思维推理范式在某些场景下能超越传统的离散标记思维链（CoT）推理，彰显其研究与应用价值。然而，尽管离散标记的CoT推理模式可通过群体相对策略优化（GRPO）等策略优化算法进行强化，但将软性思维模式与强化学习（RL）结合仍存在挑战。这一难点源于向软性思维标记注入随机性并据此更新策略的复杂性，导致先前将软性思维与GRPO结合的尝试通常表现不及离散标记GRPO方法。为充分释放软性思维潜力，本文提出新型策略优化算法SofT-GRPO，用于强化软性思维推理模式下的LLM。该算法通过向logits注入Gumbel噪声，采用Gumbel-Softmax技术避免软性思维标记超出预训练嵌入空间，并在策略梯度中运用重参数化技巧。我们在1.5B至7B参数的基座LLM上进行实验，结果表明：SofT-GRPO使软性思维LLM在Pass@1指标上略优于离散标记GRPO（平均准确率提升0.13%），而在Pass@32指标上呈现显著提升（平均准确率提升2.19%）。代码与权重已开源：https://github.com/zz1358m/SofT-GRPO-master

English

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

SofT-GRPO：通过Gumbel重参数化软思维策略优化超越离散令牌大语言模型强化学习

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

摘要

Support