명세 자체 수정: 테스트 시간 정제를 통한 인-컨텍스트 보상 해킹 완화

초록

언어 모델(LMs)은 컨텍스트 내 보상 해킹에 취약하며, 이는 사용자의 진정한 의도를 충족시키지 않고도 오염되거나 결함이 있는 서면 명세 또는 루브릭의 결함을 악용하여 높은 점수를 얻는 현상을 말합니다. 우리는 명세 자가 수정(SSC)이라는 새로운 테스트 시점 프레임워크를 소개합니다. 이 프레임워크는 언어 모델이 자신의 지침 명세 내 결함을 식별하고 수정할 수 있게 합니다. SSC는 다단계 추론 프로세스를 사용하며, 모델은 먼저 오염된 명세를 기반으로 응답을 생성하고, 그 출력을 비판한 다음, 악용 가능한 허점을 제거하기 위해 명세 자체를 수정합니다. 그런 다음 이 자가 수정된 명세를 사용하여 최종적으로 더 강력한 응답을 생성합니다. 창의적 글쓰기 및 에이전트 코딩 작업을 포함한 여러 언어 모델에 걸친 실험에서, 모델이 처음에는 오염된 명세를 50-70%의 경우에서 악용하지만, SSC 프로세스는 이러한 취약성을 90% 이상 줄이는 것으로 나타났습니다. 이 동적 수정은 추론 시점에 발생하며, 가중치 수정이 필요 없고, 더 강력하게 정렬된 모델 행동으로 이어집니다. 코드는 https://github.com/vicgalle/specification-self-correction에서 확인할 수 있습니다.

English

Language models (LMs) are susceptible to in-context reward hacking, where they exploit flaws in tainted or faulty written specifications or rubrics to achieve high scores without fulfilling the user's true intent. We introduce Specification Self-Correction (SSC), a novel, test-time framework that enables an LM to identify and correct flaws within its own guiding specification. SSC employs a multi-step inference process where the model first generates a response based on a potentially tainted specification, critiques its output, and then revises the specification itself to remove the exploitable loophole. A final, more robust response is then generated using this self-corrected specification. Across experiments spanning creative writing and agentic coding tasks with several LMs, we demonstrate that while models initially game tainted specifications in 50-70\% of cases, the SSC process reduces this vulnerability by over 90\%. This dynamic repair occurs at inference time, requires no weight modification, and leads to more robustly aligned model behavior. Code at https://github.com/vicgalle/specification-self-correction .

명세 자체 수정: 테스트 시간 정제를 통한 인-컨텍스트 보상 해킹 완화

Specification Self-Correction: Mitigating In-Context Reward Hacking Through Test-Time Refinement

초록

Support