Autocorrezione delle Specifiche: Mitigare il Reward Hacking in Contesto Attraverso l'Affinamento al Momento del Test

Abstract

I modelli linguistici (LM) sono suscettibili al fenomeno dell'"in-context reward hacking", in cui sfruttano difetti in specifiche o rubriche scritte contaminate o difettose per ottenere punteggi elevati senza soddisfare il vero intento dell'utente. Introduciamo la Correzione Automatica delle Specifiche (Specification Self-Correction, SSC), un nuovo framework operativo in fase di inferenza che consente a un LM di identificare e correggere i difetti all'interno della propria specifica guida. SSC impiega un processo di inferenza a più fasi in cui il modello genera prima una risposta basata su una specifica potenzialmente contaminata, critica il proprio output e poi rivede la specifica stessa per rimuovere il punto debole sfruttabile. Una risposta finale più robusta viene quindi generata utilizzando questa specifica autocorretta. Attraverso esperimenti che coprono compiti di scrittura creativa e codifica agentica con diversi LM, dimostriamo che, sebbene i modelli inizialmente giochino con specifiche contaminate nel 50-70% dei casi, il processo SSC riduce questa vulnerabilità di oltre il 90%. Questa riparazione dinamica avviene in fase di inferenza, non richiede modifiche ai pesi e porta a un comportamento del modello più robustamente allineato. Codice disponibile su https://github.com/vicgalle/specification-self-correction.

English

Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .

Autocorrezione delle Specifiche: Mitigare il Reward Hacking in Contesto Attraverso l'Affinamento al Momento del Test

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

Abstract

Support