ACES: Chi Testa i Test? Coerenza AUC Leave-One-Out per la Generazione di Codice

Abstract

La selezione di candidati di codice generati da LLM utilizzando test generati da LLM è complessa poiché i test stessi potrebbero essere errati. I metodi esistenti trattano tutti i test allo stesso modo o si basano su euristiche ad-hoc per filtrare i test inaffidabili. Tuttavia, determinare la correttezza di un test richiede di sapere quali codici sono corretti, creando una dipendenza circolare. La nostra intuizione chiave è che non è necessario determinare la correttezza del test: i "voti" dei test dovrebbero classificare, non solo contare. Ciò che conta non è quanti codici superano un test, ma se il test è in grado di distinguere il codice corretto da quello errato. Rompiamo la dipendenza circolare tramite una valutazione leave-one-out: si esclude un test, si classificano i codici in base ai loro punteggi aggregati su tutti i test rimanenti e si valuta se il pattern di superamento/fallimento del test escluso sia coerente con questa classifica. Formalizziamo questa coerenza come AUC leave-one-out (LOO-AUC) e dimostriamo che il valore atteso di LOO-AUC è proporzionale alla capacità di ciascun test di separare il codice corretto da quello errato. Su questa base, proponiamo ACES (AUC ConsistEncy Scoring) con due varianti complementari: ACES-C fornisce pesi in forma chiusa che approssimano provabilmente l'oracolo in valore atteso sotto una lieve assunzione sulla qualità media dei test; ACES-O rimuove questa assunzione e ottimizza iterativamente un obiettivo LOO-AUC differenziabile. Entrambi operano esclusivamente sulla matrice binaria di superamento con un overhead trascurabile e raggiungono risultati state-of-the-art in termini di Pass@k su molteplici benchmark di generazione di codice.

English

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

ACES: Chi Testa i Test? Coerenza AUC Leave-One-Out per la Generazione di Codice

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

Abstract

Support