대비 분석을 통한 코드 환경에서의 보상 해킹 탐지 벤치마킹

초록

코드 생성을 위한 강화 학습의 최근 발전은 보상 해킹을 방지하기 위해 견고한 환경을 필수적으로 만들었습니다. LLM이 코드 기반 RL에서 평가자 역할을 점점 더 많이 수행함에 따라, 보상 해킹 탐지 능력에 대한 연구는 여전히 부족한 실정입니다. 본 논문에서는 54개 범주에 걸친 보상 악용에 대한 새로운 분류 체계를 제안하고, 517개의 테스트 궤적을 포함한 합성적으로 구성되고 인간 검증된 벤치마크인 TRACE(Testing Reward Anomalies in Code Environments)를 소개합니다. 고립된 분류 시나리오에서 보상 해킹 탐지를 평가한 기존 연구와 달리, 우리는 TRACE에서 보다 현실적인 대조적 이상 탐지 설정과 이러한 평가를 대비합니다. 우리의 실험 결과, 모델들은 고립된 분류 설정보다 대조적 설정에서 보상 해킹을 더 효과적으로 포착하며, TRACE에서 가장 높은 추론 모드를 가진 GPT-5.2가 45%에서 63%로 가장 높은 탐지율을 달성했습니다. 이러한 통찰을 바탕으로, 최첨단 모델들이 구문적으로 맥락화된 보상 해킹에 비해 의미론적으로 맥락화된 보상 해킹으로 훨씬 더 어려움을 겪는다는 것을 보여줍니다. 또한 모델 행동에 대한 정성적 분석과, 정상 궤적과 해킹된 궤적의 비율 및 분석 클러스터 크기가 탐지 성능에 상당한 영향을 미친다는 것을 보여주는 ablation 연구를 추가로 수행합니다. 우리는 커뮤니티가 TRACE를 확장하고 자체 모델을 평가할 수 있도록 벤치마크와 평가 도구를 공개합니다.

English

Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.

대비 분석을 통한 코드 환경에서의 보상 해킹 탐지 벤치마킹

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

초록

Support