SEED-GRPO：基于语义熵增强的GRPO算法，面向不确定性感知的策略优化

摘要

大型语言模型（LLMs）在面对不同输入提示（问题）时展现出不同程度的置信度：某些提示会引发一致且语义相近的回答，而另一些则产生多样甚至矛盾的输出。这种差异反映了LLM对输入提示的不确定性，是模型理解给定问题自信程度的重要信号。然而，传统的群体相对策略优化（GRPO）在策略更新过程中对所有提示一视同仁，忽视了这一关于模型知识边界的关键信息。为解决这一局限，我们提出了SEED-GRPO（语义熵增强型GRPO），它明确测量LLMs对输入提示语义熵的不确定性。语义熵衡量了给定提示下多个生成答案在意义上的多样性，并利用这一指标来调节策略更新的幅度。这种基于不确定性的训练机制能够根据问题的不确定性动态调整策略更新的幅度，使得在高不确定性问题上采取更为保守的更新策略，同时在自信问题上保持原有的学习信号。在五个数学推理基准测试（AIME24 56.7、AMC 68.7、MATH 83.4、Minerva 34.2和OlympiadBench 48.0）上的实验结果表明，SEED-GRPO在平均准确率上达到了新的最先进水平，验证了基于不确定性的策略优化的有效性。

English

Large language models (LLMs) exhibit varying levels of confidence across input prompts (questions): some lead to consistent, semantically similar answers, while others yield diverse or contradictory outputs. This variation reflects LLM's uncertainty about the input prompt, a signal of how confidently the model understands a given problem. However, vanilla Group Relative Policy Optimization (GRPO) treats all prompts equally during policy updates, ignoring this important information about the model's knowledge boundaries. To address this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which explicitly measures LLMs' uncertainty of the input prompts semantic entropy. Semantic entropy measures the diversity of meaning in multiple generated answers given a prompt and uses this to modulate the magnitude of policy updates. This uncertainty-aware training mechanism enables dynamic adjustment of policy update magnitudes based on question uncertainty. It allows more conservative updates on high-uncertainty questions while maintaining the original learning signal on confident ones. Experimental results on five mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva 34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves new state-of-the-art performance in average accuracy, validating the effectiveness of uncertainty-aware policy optimization.

SEED-GRPO：基于语义熵增强的GRPO算法，面向不确定性感知的策略优化

SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization

摘要

Support