评分标准作为攻击面：大语言模型评判者的隐性偏好漂移

摘要

针对大语言模型的评估与对齐流程日益依赖基于LLM的评判器，其行为由自然语言评分标准指导并通过基准测试验证。我们发现该流程中存在一个先前未被充分认识的脆弱性，称之为"标准诱导偏好漂移"。即使评分标准的修改通过了基准验证，仍可能对评判器在目标领域的偏好产生系统性、方向性偏移。由于评分标准作为高层决策接口，此类漂移可能源于看似自然且保持准则的修改，并难以通过聚合基准指标或有限抽检被发现。我们进一步证明该漏洞可通过基于评分标准的偏好攻击被利用——合规的评分标准修改会使评判结果偏离目标领域固定的人类或可信参考标准，系统性地诱发RIPD现象，导致目标领域准确率最高下降9.5%（有益性）和27.9%（无害性）。当这些评判结果用于生成下游后训练所需的偏好标签时，诱导偏差将通过对齐流程传播并内化至训练策略中，最终导致模型行为出现持续性、系统性的偏移。总体而言，我们的研究揭示评分标准是敏感且可操纵的控制接口，凸显了超越评估器可靠性范畴的系统级对齐风险。代码已开源：https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface。警告：部分内容可能包含潜在有害信息，请读者谨慎阅读。

English

Evaluation and alignment pipelines for large language models increasingly rely on LLM-based judges, whose behavior is guided by natural-language rubrics and validated on benchmarks. We identify a previously under-recognized vulnerability in this workflow, which we term Rubric-Induced Preference Drift (RIPD). Even when rubric edits pass benchmark validation, they can still produce systematic and directional shifts in a judge's preferences on target domains. Because rubrics serve as a high-level decision interface, such drift can emerge from seemingly natural, criterion-preserving edits and remain difficult to detect through aggregate benchmark metrics or limited spot-checking. We further show this vulnerability can be exploited through rubric-based preference attacks, in which benchmark-compliant rubric edits steer judgments away from a fixed human or trusted reference on target domains, systematically inducing RIPD and reducing target-domain accuracy up to 9.5% (helpfulness) and 27.9% (harmlessness). When these judgments are used to generate preference labels for downstream post-training, the induced bias propagates through alignment pipelines and becomes internalized in trained policies. This leads to persistent and systematic drift in model behavior. Overall, our findings highlight evaluation rubrics as a sensitive and manipulable control interface, revealing a system-level alignment risk that extends beyond evaluator reliability alone. The code is available at: https://github.com/ZDCSlab/Rubrics-as-an-Attack-Surface. Warning: Certain sections may contain potentially harmful content that may not be appropriate for all readers.

评分标准作为攻击面：大语言模型评判者的隐性偏好漂移

Rubrics as an Attack Surface: Stealthy Preference Drift in LLM Judges

摘要

Support