ChatPaper.aiChatPaper

规范自校正:通过测试时精炼缓解上下文奖励黑客问题

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

July 24, 2025
作者: Víctor Gallego
cs.AI

摘要

语言模型(LMs)易受上下文奖励操纵的影响,即它们利用有缺陷或错误的书面规范或评分标准中的漏洞,在不满足用户真实意图的情况下获得高分。我们提出了一种新颖的测试时框架——规范自我修正(Specification Self-Correction, SSC),该框架使语言模型能够识别并修正其自身指导规范中的缺陷。SSC采用多步推理过程,模型首先基于可能存在缺陷的规范生成响应,随后对其输出进行批判性评估,进而修订规范本身以消除可利用的漏洞。最终,使用这一自我修正后的规范生成更为稳健的响应。在涵盖创意写作和代理编码任务的多个语言模型实验中,我们发现,尽管模型最初在50-70%的情况下会利用有缺陷的规范进行操纵,但SSC过程将这一脆弱性降低了超过90%。这种动态修复在推理时发生,无需修改模型权重,从而引导模型行为更加稳健地保持一致。代码详见https://github.com/vicgalle/specification-self-correction。
English
Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .
PDF52July 28, 2025