コード環境における報酬ハッキング検出のベンチマーク：対照分析によるアプローチ

要旨

コード生成における強化学習の最近の進展により、報酬ハッキングを防ぐための堅牢な環境が不可欠となっている。コードベースのRLにおいてLLMが評価器として活用される機会が増える中、その報酬ハッキング検出能力は未だ十分に研究されていない。本論文では、54のカテゴリにわたる報酬悪用の新規分類体系を提案し、517のテスト軌道を含む合成的に作成され人的に検証されたベンチマークTRACEを紹介する。従来研究が報酬ハッキング検出を個別分類シナリオで評価してきたのに対し、我々はTRACE上でより現実的な対照的異常検知設定による評価を対比させる。実験結果から、モデルは個別分類設定よりも対照的設定において報酬ハッキングを効果的に捕捉し、GPT-5.2最高推論モードがTRACEにおいて63%（個別設定の45%から向上）の最高検出率を達成することが明らかとなった。この知見に基づき、最先端モデルが構文的に文脈化された報酬ハッキングよりも意味的に文脈化されたものに対して著しく困難を抱えることを実証する。さらにモデル行動の定性分析、および正常軌道とハッキング軌道の比率や分析クラスタサイズが検出性能に大きく影響することを示す ablation 研究を実施する。学界がTRACEを拡張し自身のモデルを評価できるよう、ベンチマークと評価ハーネスを公開する。

English

Recent advances in reinforcement learning for code generation have made robust environments essential to prevent reward hacking. As LLMs increasingly serve as evaluators in code-based RL, their ability to detect reward hacking remains understudied. In this paper, we propose a novel taxonomy of reward exploits spanning across 54 categories and introduce TRACE (Testing Reward Anomalies in Code Environments), a synthetically curated and human-verified benchmark containing 517 testing trajectories. Unlike prior work that evaluates reward hack detection in isolated classification scenarios, we contrast these evaluations with a more realistic, contrastive anomaly detection setup on TRACE. Our experiments reveal that models capture reward hacks more effectively in contrastive settings than in isolated classification settings, with GPT-5.2 with highest reasoning mode achieving the best detection rate at 63%, up from 45% in isolated settings on TRACE. Building on this insight, we demonstrate that state-of-the-art models struggle significantly more with semantically contextualized reward hacks compared to syntactically contextualized ones. We further conduct qualitative analyses of model behaviors, as well as ablation studies showing that the ratio of benign to hacked trajectories and analysis cluster sizes substantially impact detection performance. We release the benchmark and evaluation harness to enable the community to expand TRACE and evaluate their models.

コード環境における報酬ハッキング検出のベンチマーク：対照分析によるアプローチ

Benchmarking Reward Hack Detection in Code Environments via Contrastive Analysis

要旨

Support