SofT-GRPO: Het Overtreffen van Reinforcement Learning voor LLM's met Discrete Tokens via Gumbel-Hergerepresenteerd Zacht-Denken Beleidsoptimalisatie

Samenvatting

Het soft-thinking-paradigma voor redeneren met Large Language Models (LLM) kan in sommige scenario's superieur presteren aan de conventionele redeneerpatronen met discrete tokens, zoals Chain-of-Thought (CoT), wat de onderzoeks- en toepassingswaarde ervan onderstreept. Het discrete-token CoT-redeneerpatroon kan echter worden versterkt via beleidsoptimalisatie-algoritmen zoals group relative policy optimization (GRPO), terwijl het uitbreiden van het soft-thinking-patroon met Reinforcement Learning (RL) een uitdaging blijft. Deze moeilijkheid vloeit voort uit de complexiteit van het injecteren van stochasticiteit in soft-thinking-tokens en het dienovereenkomstig bijwerken van soft-thinking-beleid. Als gevolg daarvan presteren eerdere pogingen om soft-thinking met GRPO te combineren doorgaans minder goed dan hun tegenhangers met discrete-token GRPO. Om het volledige potentieel van soft-thinking te ontsluiten, presenteert dit artikel een nieuw beleidsoptimalisatie-algoritme, SofT-GRPO, om LLM's te versterken onder het soft-thinking-redeneerpatroon. SofT-GRPO injecteert Gumbel-ruis in logits, gebruikt de Gumbel-Softmax-techniek om te voorkomen dat soft-thinking-tokens buiten de vooraf getrainde embeddingruimte vallen, en benut de reparameterisatietruc in de beleidsgradiënt. Wij voeren experimenten uit met basis-LLM's variërend van 1,5B tot 7B parameters, en resultaten tonen aan dat SofT-GRPO soft-thinking-LLM's in staat stelt om discrete-token GRPO licht te overtreffen op Pass@1 (+0,13% gemiddelde nauwkeurigheid), terwijl het een aanzienlijke verbetering vertoont op Pass@32 (+2,19% gemiddelde nauwkeurigheid). Code en gewichten zijn beschikbaar op https://github.com/zz1358m/SofT-GRPO-master.

English

The soft-thinking paradigm for Large Language Model (LLM) reasoning can outperform the conventional discrete-token Chain-of-Thought (CoT) reasoning in some scenarios, underscoring its research and application value. However, while the discrete-token CoT reasoning pattern can be reinforced through policy optimization algorithms such as group relative policy optimization (GRPO), extending the soft-thinking pattern with Reinforcement Learning (RL) remains challenging. This difficulty stems from the complexities of injecting stochasticity into soft-thinking tokens and updating soft-thinking policies accordingly. As a result, previous attempts to combine soft-thinking with GRPO typically underperform their discrete-token GRPO counterparts. To fully unlock the potential of soft-thinking, this paper presents a novel policy optimization algorithm, SofT-GRPO, to reinforce LLMs under the soft-thinking reasoning pattern. SofT-GRPO injects the Gumbel noise into logits, employs the Gumbel-Softmax technique to avoid soft-thinking tokens outside the pre-trained embedding space, and leverages the reparameterization trick in policy gradient. We conduct experiments across base LLMs ranging from 1.5B to 7B parameters, and results demonstrate that SofT-GRPO enables soft-thinking LLMs to slightly outperform discrete-token GRPO on Pass@1 (+0.13% on average accuracy), while exhibiting a substantial uplift on Pass@32 (+2.19% on average accuracy). Codes and weights are available on https://github.com/zz1358m/SofT-GRPO-master

SofT-GRPO: Het Overtreffen van Reinforcement Learning voor LLM's met Discrete Tokens via Gumbel-Hergerepresenteerd Zacht-Denken Beleidsoptimalisatie

SofT-GRPO: Surpassing Discrete-Token LLM Reinforcement Learning via Gumbel-Reparameterized Soft-Thinking Policy Optimization

Samenvatting

Support