SEED-GRPO:語義熵增強型GRPO,用於不確定性感知的策略優化
SEED-GRPO: Semantic Entropy Enhanced GRPO for Uncertainty-Aware Policy Optimization
May 18, 2025
作者: Minghan Chen, Guikun Chen, Wenguan Wang, Yi Yang
cs.AI
摘要
大型語言模型(LLMs)在面對不同輸入提示(問題)時展現出不同程度的信心:有些提示會導致一致且語義相似的答案,而其他則產生多樣或矛盾的輸出。這種變化反映了LLM對輸入提示的不確定性,這是一個信號,表明模型對特定問題的理解有多自信。然而,原始的群體相對策略優化(GRPO)在策略更新時平等對待所有提示,忽略了這一關於模型知識邊界的重要信息。為解決這一限制,我們提出了SEED-GRPO(語義熵增強型GRPO),它明確測量LLMs對輸入提示語義熵的不確定性。語義熵衡量給定提示下多個生成答案的語義多樣性,並利用這一點來調節策略更新的幅度。這種基於不確定性的訓練機制能夠根據問題的不確定性動態調整策略更新的幅度。它允許對高不確定性問題進行更保守的更新,同時在自信問題上保持原有的學習信號。在五個數學推理基準(AIME24 56.7、AMC 68.7、MATH 83.4、Minerva 34.2和OlympiadBench 48.0)上的實驗結果表明,SEED-GRPO在平均準確率上達到了新的最先進水平,驗證了基於不確定性的策略優化的有效性。
English
Large language models (LLMs) exhibit varying levels of confidence across
input prompts (questions): some lead to consistent, semantically similar
answers, while others yield diverse or contradictory outputs. This variation
reflects LLM's uncertainty about the input prompt, a signal of how confidently
the model understands a given problem. However, vanilla Group Relative Policy
Optimization (GRPO) treats all prompts equally during policy updates, ignoring
this important information about the model's knowledge boundaries. To address
this limitation, we propose SEED-GRPO (Semantic Entropy EnhanceD GRPO), which
explicitly measures LLMs' uncertainty of the input prompts semantic entropy.
Semantic entropy measures the diversity of meaning in multiple generated
answers given a prompt and uses this to modulate the magnitude of policy
updates. This uncertainty-aware training mechanism enables dynamic adjustment
of policy update magnitudes based on question uncertainty. It allows more
conservative updates on high-uncertainty questions while maintaining the
original learning signal on confident ones. Experimental results on five
mathematical reasoning benchmarks (AIME24 56.7, AMC 68.7, MATH 83.4, Minerva
34.2, and OlympiadBench 48.0) demonstrate that SEED-GRPO achieves new
state-of-the-art performance in average accuracy, validating the effectiveness
of uncertainty-aware policy optimization.