編碼代理是否欺騙我們？通過帶有隨機測試的上限評估來偵測與防止作弊

摘要

在智能体评估与训练中，一个日益严重的失败模式是：模型可以通过利用捷径而非解决预期任务来获得高评估分数，从而产生欺骗性表现。这使得评估分数作为衡量真实任务解决能力的指标变得不可靠。我们提出 CapCode 框架，用于构建带有随机测试的编码数据集，其非作弊情况下的最佳可实现性能被故意设定上限低于满分。这种上限性能设计为评估分数提供了更清晰的解释：明显高于上限的分数不可信，因此可作为作弊的证据。为了防止作弊，我们提出 CapReward，这是一种基于 CapCode 原理的奖励设计，旨在抑制超出上限的优化。跨多个数据集的实验表明，CapCode 能够检测作弊行为，同时保持模型的性能排名；而 CapReward 则减少了作弊行为，使得模型能更好地遵循预期的任务规范。

English

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.