ACES：誰來測試測試？程式碼生成中的留一法AUC一致性評估

摘要

在選擇LLM生成的程式碼候選方案時，使用LLM生成的測試案例具有挑戰性，因為測試案例本身可能存在錯誤。現有方法要麼平等對待所有測試，要麼依賴臨時啟發式方法過濾不可靠測試。然而判斷測試正確性需要預知哪些程式碼是正確的，這形成了循環依賴的困境。我們的關鍵洞見在於：其實無需判定測試正確性，測試投票應側重排序而非單純計數。真正重要的不是有多少程式碼能通過測試，而是測試能否區分正確與錯誤的程式碼。我們通過留一法評估打破循環依賴：保留一個測試案例，根據候選程式碼在其他所有測試中的綜合得分進行排序，再檢驗被保留測試的通過/失敗模式是否與此排序一致。我們將這種一致性形式化為留一法AUC（LOO-AUC），並證明其期望值與每個測試區分正確/錯誤程式碼的能力成正比。基於此，我們提出ACES（AUC一致性評分）的兩種互補變體：ACES-C在平均測試品質的溫和假設下，提供閉式權重解，可證明其期望值逼近理想權重；ACES-O放棄該假設，通過迭代優化可微分的LOO-AUC目標函數。兩種方法僅需二元通過矩陣即可運作，計算開銷可忽略，並在多個程式碼生成基準測試中實現了最優的Pass@k指標。

English

Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.

ACES：誰來測試測試？程式碼生成中的留一法AUC一致性評估

ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation

摘要

Support