仕様自己修正：テスト時精緻化によるインコンテキスト報酬ハッキングの軽減

要旨

言語モデル（LMs）は、文脈内報酬ハッキングに対して脆弱であり、汚染されたまたは欠陥のある書面仕様やルーブリックの欠陥を悪用して、ユーザーの真の意図を満たすことなく高得点を達成しようとします。本論文では、仕様自己修正（Specification Self-Correction, SSC）という新しい推論時フレームワークを提案します。SSCは、LMが自身のガイド仕様内の欠陥を特定し修正することを可能にします。SSCは多段階の推論プロセスを採用し、モデルはまず潜在的に汚染された仕様に基づいて応答を生成し、その出力を批判し、その後、悪用可能な抜け穴を除去するために仕様自体を修正します。最後に、この自己修正された仕様を使用して、より堅牢な応答が生成されます。創造的ライティングやエージェント的コーディングタスクにわたる複数のLMを用いた実験を通じて、モデルが最初に汚染された仕様を50-70％のケースで悪用する一方で、SSCプロセスがこの脆弱性を90％以上削減することを示します。この動的修復は推論時に発生し、重みの変更を必要とせず、より堅牢に整合したモデル行動を導きます。コードはhttps://github.com/vicgalle/specification-self-correctionにあります。

English

Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .

仕様自己修正：テスト時精緻化によるインコンテキスト報酬ハッキングの軽減

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

要旨

Support