ChatPaper.aiChatPaper

基于对比分析的代码环境中奖励机制漏洞检测基准研究

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

January 27, 2026
作者: Darshan Deshpande, Anand Kannappan, Rebecca Qian
cs.AI

摘要

随着代码生成领域强化学习的快速发展,构建稳健环境以防止奖励破解已变得至关重要。在基于代码的强化学习中,大语言模型日益承担评估者角色,但其检测奖励破解的能力仍缺乏深入研究。本文提出了一种涵盖54个类别的奖励漏洞新型分类法,并推出TRACE(代码环境中的奖励异常测试)——一个包含517条测试轨迹的合成策划且经人工验证的基准数据集。与以往在孤立分类场景下评估奖励破解检测的研究不同,我们在TRACE上采用更具现实意义的对比式异常检测框架进行对比实验。实验表明,模型在对比设置中捕获奖励漏洞的效果显著优于孤立分类设置,其中GPT-5.2最高推理模式的检测率从孤立设置的45%提升至63%。基于这一发现,我们论证了前沿模型对语义上下文化奖励漏洞的检测难度远高于句法上下文化漏洞。我们进一步开展了模型行为的定性分析,并通过消融实验证明良性轨迹与破解轨迹的比例及分析集群规模会显著影响检测性能。我们公开了基准数据集与评估工具,以推动学界扩展TRACE并评估相关模型。
English
Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.
PDF12January 31, 2026