ACES: 테스트를 검증하는 방법은? 코드 생성을 위한 Leave-One-Out AUC 일관성

초록

LLM 생성 테스트를 사용하여 LLM 생성 코드 후보를 선별하는 것은 테스트 자체가 부정확할 수 있기 때문에 어려운 과제입니다. 기존 방법들은 모든 테스트를 동등하게 취급하거나 신뢰할 수 없는 테스트를 걸러내기 위해 임시 휴리스틱에 의존해왔습니다. 그러나 테스트의 정확성을 판단하려면 어떤 코드가 정답인지 알아야 하는 순환적 의존성이 발생합니다. 우리의 핵심 통찰은 테스트 정확성을 전혀 판단할 필요가 없다는 점입니다. 테스트 투표는 단순히 개수를 세는 것이 아니라 순위를 매겨야 합니다. 중요한 것은 얼마나 많은 코드가 테스트를 통과하는지가 아니라, 해당 테스트가 정답 코드와 오답 코드를 구별할 수 있는지 여부입니다. 우리는 Leave-one-out 평가를 통해 이 순환적 의존성을 깨뜨립니다. 하나의 테스트를 제외한 후, 나머지 모든 테스트에 대한 종합 점수로 코드 순위를 매기고, 제외된 테스트의 통과/실패 패턴이 이 순위와 일치하는지 측정합니다. 우리는 이 일치도를 LOO-AUC(Leave-one-out AUC)로 공식화하고, 기대 LOO-AUC가 각 테스트의 정답/오답 코드 분별 능력에 비례함을 증명합니다. 이를 바탕으로 우리는 두 가지 상호 보완적인 변종을 가진 ACES(AUC 일관성 점수화)를 제안합니다. ACES-C는 평균 테스트 품질에 대한 가벼운 가정 하에서 기대값 기준으로 오라클에 근사하는 것이 증명되는 closed-form 가중치를 제공합니다. ACES-O는 이 가정을 없애고 미분 가능한 LOO-AUC 목적 함수를 반복적으로 최적화합니다. 두 방법 모두 이진 통과 행렬만으로 최소한의 오버헤드로 작동하며, 여러 코드 생성 벤치마크에서 최첨단 Pass@k 성능을 달성합니다.

English

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

ACES: 테스트를 검증하는 방법은? 코드 생성을 위한 Leave-One-Out AUC 일관성

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

초록

Support