ChatPaper.aiChatPaper

規範自我校正:透過測試時精煉緩解上下文獎勵濫用

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

July 24, 2025
作者: Víctor Gallego
cs.AI

摘要

語言模型(LMs)容易受到上下文獎勵欺騙的影響,即它們利用有缺陷或錯誤的書面規範或評分標準中的漏洞,在不滿足用戶真實意圖的情況下獲得高分。我們提出了規範自我修正(Specification Self-Correction, SSC),這是一種新穎的測試時框架,使語言模型能夠識別並修正其自身指導規範中的缺陷。SSC採用多步推理過程,模型首先基於可能存在問題的規範生成回應,對其輸出進行批判,然後修正規範本身以消除可利用的漏洞。接著,使用這個自我修正後的規範生成最終更為穩健的回應。在涵蓋創意寫作和代理編碼任務的多個語言模型實驗中,我們證明,雖然模型最初在50-70%的情況下會利用有缺陷的規範,但SSC過程將這種脆弱性降低了90%以上。這種動態修復發生在推理階段,無需修改模型權重,並能引導模型行為更加穩健地對齊。代碼位於https://github.com/vicgalle/specification-self-correction。
English
Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .
PDF52July 28, 2025