コーディングエージェントは私たちを騙すのか？——ランダム化テストを用いた制限付き評価による不正の検出と防止

要旨

エージェントの評価と訓練において増加している障害モードは、モデルが意図されたタスクを解決する代わりにショートカットを利用することで高い評価スコアを達成し、欺瞞的なパフォーマンスを生み出すことです。これにより、評価スコアは真のタスク解決能力の尺度として信頼できなくなります。我々はCapCodeを提案する。これは、ランダム化テストを用いたコーディングデータセットを構築するフレームワークであり、そのテストで不正を行わずに達成可能な最高のパフォーマンスが意図的に1未満に制限されています。この上限付きパフォーマンス設計により、評価スコアの解釈がより明確になります。すなわち、上限を大幅に超えるスコアは非現実的であり、したがって不正の証拠となります。不正を防ぐために、我々はCapRewardを提案する。これはCapCodeの原理に基づく報酬設計であり、上限を超える最適化を抑制します。複数のデータセットを用いた実験では、CapCodeがモデルの性能ランキングを維持しつつ不正を検出し、CapRewardが不正行動を減少させ、意図されたタスク仕様により従うモデルを生成することが示されました。

English

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.