ACES：谁来检验测试？代码生成中的留一法AUC一致性评估

摘要

利用LLM生成的测试来筛选LLM生成的代码候选方案具有挑战性，因为测试本身可能存在错误。现有方法要么对所有测试一视同仁，要么依赖临时启发式规则过滤不可靠测试。然而判断测试正确性需要预先知道哪些代码是正确的，这形成了循环依赖。我们的核心洞见在于：无需判定测试正确性，测试投票应当用于排序而非简单计数。关键不在于有多少代码能通过测试，而在于测试能否区分正确与错误代码。我们通过留一评估打破循环依赖：保留一个测试，根据代码在剩余测试中的综合得分进行排序，并评估被保留测试的通过/失败模式是否与该排序一致。我们将这种一致性形式化为留一法AUC（LOO-AUC），并证明其期望值与每个测试区分正确/错误代码的能力成正比。基于此，我们提出ACES（AUC一致性评分）的两种互补变体：ACES-C在平均测试质量满足温和假设时，提供闭式权重以可证明地逼近理论最优值；ACES-O摒弃该假设，通过迭代优化可微分的LOO-AUC目标函数。两种方法仅需二进制通过矩阵即可运行且计算开销可忽略，在多个代码生成基准测试中实现了最先进的Pass@k指标。

English

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.