ACES: テストを誰がテストするのか？コード生成におけるLeave-One-Out AUC一貫性

要旨

LLMが生成したコード候補を、LLMが生成したテストを用いて選別することは困難です。なぜなら、テスト自体が誤っている可能性があるためです。既存の手法は、すべてのテストを平等に扱うか、信頼性の低いテストをフィルタリングするためのアドホックなヒューリスティックに依存しています。しかし、テストの正しさを判断するには、どのコードが正しいかを知る必要があり、循環依存が生じます。我々の重要な洞察は、テストの正しさを決定する必要は全くないということです。テストの投票は、単に数を数えるのではなく、順位付けを行うべきです。重要なのは、何個のコードがテストに合格するかではなく、そのテストが正しいコードと誤ったコードを区別できるかどうかです。我々は、leave-one-out評価によってこの循環依存を打破します。1つのテストを除外し、残りのすべてのテストにおける総合スコアに基づいてコードを順位付けし、除外したテストの合格/不合格パターンがこの順位付けと一致するかを測定します。我々はこの一致をleave-one-out AUC（LOO-AUC）として形式化し、期待されるLOO-AUCが、各テストの正しいコードと誤ったコードを分離する能力に比例することを証明します。これを基盤として、我々はACES（AUC ConsistEncy Scoring）を提案します。これは2つの相補的な変種を持ちます。ACES-Cは、平均的なテスト品質に関する穏やかな仮定の下で、期待値においてオラクルを近似することが証明された閉形式の重みを提供します。ACES-Oはこの仮定を廃し、微分可能なLOO-AUC目的関数を反復最適化します。いずれもバイナリの合格行列のみを扱い、無視できる程度のオーバーヘッドで動作し、複数のコード生成ベンチマークにおいてstate-of-the-artのPass@kを達成します。

English

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

ACES: テストを誰がテストするのか？コード生成におけるLeave-One-Out AUC一貫性

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

要旨

Support