SofT-GRPO: ガンベル再パラメータ化によるソフト思考方策最適化を介した離散トークンLLM強化学習の超越

要旨

大規模言語モデル（LLM）の推論におけるソフト思考パラダイムは、特定のシナリオにおいて従来の離散トークン連鎖思考（CoT）推論を凌駕する可能性があり、その研究および応用価値を示唆している。しかしながら、離散トークンCoT推論パターンがグループ相対方策最適化（GRPO）などの方策最適化アルゴリズムを通じて強化可能である一方で、ソフト思考パターンを強化学習（RL）で拡張することは依然として困難である。この困難は、ソフト思考トークンに確率性を導入し、それに応じてソフト思考方策を更新する複雑さに起因する。その結果、従来のソフト思考とGRPOの統合試行は、通常、離散トークンGRPOの対応手法に劣る性能を示してきた。ソフト思考の可能性を十分に引き出すため、本論文はソフト思考推論パターン下でLLMを強化する新しい方策最適化アルゴリズム、SofT-GRPOを提案する。SofT-GRPOは、ロジットにガンベルノイズを注入し、事前学習済み埋め込み空間外のソフト思考トークンを回避するためにGumbel-Softmax技術を採用し、方策勾配における再パラメータ化トリックを活用する。1.5Bから7BパラメータにわたるベースLLMで実験を実施した結果、SofT-GRPOにより、ソフト思考LLMはPass@1（平均精度+0.13%）では離散トークンGRPOをわずかに上回り、Pass@32（平均精度+2.19%）では大幅な向上を示すことが実証された。コードと重みはhttps://github.com/zz1358m/SofT-GRPO-master で公開されている。

English

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

SofT-GRPO: ガンベル再パラメータ化によるソフト思考方策最適化を介した離散トークンLLM強化学習の超越

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

要旨

Support