ACES:誰來測試測試?程式碼生成中的留一法AUC一致性評估
ACES: Who Tests the Tests? Leave-One-Out AUC Consistency for Code Generation
April 5, 2026
作者: Hui Sun, Yun-Ji Zhang, Zheng Xie, Ren-Biao Liu, Yali Du, Xin-Ye Li, Ming Li
cs.AI
摘要
在選擇LLM生成的程式碼候選方案時,使用LLM生成的測試案例具有挑戰性,因為測試案例本身可能存在錯誤。現有方法要麼平等對待所有測試,要麼依賴臨時啟發式方法過濾不可靠測試。然而判斷測試正確性需要預知哪些程式碼是正確的,這形成了循環依賴的困境。我們的關鍵洞見在於:其實無需判定測試正確性,測試投票應側重排序而非單純計數。真正重要的不是有多少程式碼能通過測試,而是測試能否區分正確與錯誤的程式碼。我們通過留一法評估打破循環依賴:保留一個測試案例,根據候選程式碼在其他所有測試中的綜合得分進行排序,再檢驗被保留測試的通過/失敗模式是否與此排序一致。我們將這種一致性形式化為留一法AUC(LOO-AUC),並證明其期望值與每個測試區分正確/錯誤程式碼的能力成正比。基於此,我們提出ACES(AUC一致性評分)的兩種互補變體:ACES-C在平均測試品質的溫和假設下,提供閉式權重解,可證明其期望值逼近理想權重;ACES-O放棄該假設,通過迭代優化可微分的LOO-AUC目標函數。兩種方法僅需二元通過矩陣即可運作,計算開銷可忽略,並在多個程式碼生成基準測試中實現了最優的Pass@k指標。
English
Selecting LLM-generated code candidates using LLM-generated tests is challenging because the tests themselves may be incorrect. Existing methods either treat all tests equally or rely on ad-hoc heuristics to filter unreliable tests. Yet determining test correctness requires knowing which codes are correct, creating a circular dependency. Our key insight is that we need not determine test correctness at all: test votes should rank, not merely count. What matters is not how many codes pass a test, but whether the test can distinguish correct from incorrect code. We break the circular dependency via leave-one-out evaluation: hold out one test, rank codes by their aggregate scores on all remaining tests, and measure whether the held-out test's pass/fail pattern agrees with this ranking. We formalize this agreement as the leave-one-out AUC~(LOO-AUC) and prove that the expected LOO-AUC is proportional to each test's ability to separate correct code from incorrect code. Building on this, we propose ACES~(AUC ConsistEncy Scoring) with two complementary variants: ACES-C provides closed-form weights that provably approximate the oracle in expectation under a mild assumption on average test quality; ACES-O drops this assumption and iteratively optimizes a differentiable LOO-AUC objective. Both operate solely on the binary pass matrix with negligible overhead, and achieve state-of-the-art Pass@k on multiple code generation benchmarks.