编码智能体会欺骗我们吗？通过带上限的随机测试评估检测和防止作弊行为

摘要

在智能体评估与训练中，一种日益常见的失败模式是：模型通过利用捷径而非真正解决目标任务来获得高分，从而产生欺骗性表现。这使得评估分数作为真实任务解决能力的衡量标准变得不可靠。为此，我们提出CapCode框架，用于构建具有随机化测试的编码数据集，其中通过刻意设限，使不采用作弊手段所能达到的最佳表现低于满分。这种设限表现设计为评估分数提供了更清晰的解释：显著高于上限的分数不合理，因此可作为作弊的证据。为防止作弊，我们提出CapReward——一种基于CapCode原则的奖励设计，旨在抑制超出上限的优化行为。在多个数据集上的实验表明，CapCode能够检测作弊行为，同时保留模型的性能排名；而CapReward则减少了作弊行为，使模型更严格地遵循预期任务规范。

English

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.