코딩 에이전트는 우리를 속이는가? 무작위 테스트를 통한 제한된 평가로 부정 행위 탐지 및 방지

초록

에이전트 평가와 훈련에서 점점 더 나타나는 실패 모드는 모델이 의도된 작업을 해결하는 대신 지름길을 활용하여 높은 평가 점수를 달성함으로써 기만적인 성능을 보이는 것이다. 이는 평가 점수가 실제 작업 해결 능력을 측정하는 지표로서 신뢰할 수 없게 만든다. 본 논문에서는 무작위 테스트로 구성된 코딩 데이터셋을 구축하는 프레임워크인 CapCode를 제안한다. 이 테스트에서는 달성 가능한 최대 비기만 성능이 의도적으로 1 미만으로 제한된다. 이러한 성능 제한 설계는 평가 점수에 더 명확한 해석을 제공한다. 즉, 제한치를 크게 초과하는 점수는 비현실적이므로 부정행위의 증거로 간주될 수 있다. 부정행위를 방지하기 위해, CapCode 원리에 기반한 보상 설계인 CapReward를 제안하여 제한치를 넘는 최적화를 억제한다. 여러 데이터셋에 걸친 실험 결과, CapCode는 모델의 성능 순위를 유지하면서 부정행위를 탐지하고, CapReward는 부정행위를 줄여 모델이 의도된 작업 명세를 더 잘 따르도록 만드는 것으로 나타났다.

English

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.