ACES:谁来检验测试?代码生成中的留一法AUC一致性评估
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
April 5, 2026
作者: Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li
cs.AI
摘要
利用LLM生成的测试来筛选LLM生成的代码候选方案具有挑战性,因为测试本身可能存在错误。现有方法要么对所有测试一视同仁,要么依赖临时启发式规则过滤不可靠测试。然而判断测试正确性需要预先知道哪些代码是正确的,这形成了循环依赖。我们的核心洞见在于:无需判定测试正确性,测试投票应当用于排序而非简单计数。关键不在于有多少代码能通过测试,而在于测试能否区分正确与错误代码。我们通过留一评估打破循环依赖:保留一个测试,根据代码在剩余测试中的综合得分进行排序,并评估被保留测试的通过/失败模式是否与该排序一致。我们将这种一致性形式化为留一法AUC(LOO-AUC),并证明其期望值与每个测试区分正确/错误代码的能力成正比。基于此,我们提出ACES(AUC一致性评分)的两种互补变体:ACES-C在平均测试质量满足温和假设时,提供闭式权重以可证明地逼近理论最优值;ACES-O摒弃该假设,通过迭代优化可微分的LOO-AUC目标函数。两种方法仅需二进制通过矩阵即可运行且计算开销可忽略,在多个代码生成基准测试中实现了最先进的Pass@k指标。
English
Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.