基于对比分析的代码环境奖励攻击检测基准研究

摘要

近年來，程式碼生成領域的強化學習進展使得穩健環境對於防範獎勵破解變得至關重要。隨著大型語言模型日益充當程式碼強化學習中的評估者，其檢測獎勵破解的能力仍缺乏深入研究。本文提出一個涵蓋54個類別的新型獎勵漏洞分類法，並推出TRACE（程式碼環境中的獎勵異常測試）——一個包含517條測試軌跡、經人工驗證的合成基準數據集。有別於以往在孤立分類情境下評估獎勵破解檢測的研究，我們通過TRACE數據集對比了更貼近現實的對比式異常檢測框架。實驗結果表明，模型在對比情境下的獎勵破解檢測效果顯著優於孤立分類情境：GPT-5.2在最高推理模式下檢測率達63%，較孤立情境的45%實現大幅提升。基於此發現，我們證實頂尖模型對語義情境化獎勵破解的處理能力遠遜於對語法情境化破解的處理。我們進一步開展了模型行為的定性分析，並通過消融實驗證明良性軌跡與破解軌跡的比例及分析聚類規模會顯著影響檢測性能。現公開釋出基準數據集與評估框架，以助力學界擴展TRACE並評估相關模型。

English

Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.

基于对比分析的代码环境奖励攻击检测基准研究

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

摘要

Support