ChatPaper.aiChatPaper

SR-GRPO:基于稳定秩作为大语言模型对齐的内在几何奖励

SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

December 2, 2025
作者: Yixuan Tang, Yi Yang
cs.AI

摘要

传统上,使大型语言模型(LLMs)与人类偏好对齐通常依赖外部监督,但这种方法存在明显局限:人工标注稀缺且主观,奖励模型易受奖励破解影响,而自评估方法则受制于提示敏感性和偏差。本研究提出稳定秩(stable rank)——一种源自模型表征的内在、无需标注的质量信号。稳定秩通过计算总方差与主导方向方差的比值,衡量隐藏状态的有效维度,从信息在表征维度间分布的方式中捕捉质量信息。实验表明,稳定秩在RewardBench上达到84.04%的准确率,并通过最佳N采样(Best-of-N sampling)将任务准确率较贪婪解码平均提升11.3个百分点。基于此发现,我们提出稳定秩分组相对策略优化(SR-GRPO),将稳定秩作为强化学习的奖励信号。在不依赖外部监督的情况下,SR-GRPO将Qwen2.5-1.5B-Instruct模型在STEM任务上的表现提升10%,数学推理能力提升19%,效果优于基于学习奖励模型和自评估的基线方法。我们的研究表明,质量信号可从模型内部几何结构中提取,为无需外部监督的可扩展对齐提供了新路径。
English
Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.
PDF51December 5, 2025