重新思考LLM代碼生成的驗證:從生成到測試
Rethinking Verification for LLM Code Generation: From Generation to Testing
July 9, 2025
作者: Zihan Ma, Taolin Zhang, Maosong Cao, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
cs.AI
摘要
大型語言模型(LLMs)近期在程式碼生成基準測試如HumanEval和LiveCodeBench中取得了顯著成功。然而,深入分析顯示,這些評估套件通常僅包含有限數量的同質測試案例,導致細微的錯誤未被檢測到。這不僅人為地提高了測量性能,還損害了利用可驗證獎勵的強化學習框架(RLVR)中的準確獎勵估計。為解決這些關鍵缺陷,我們系統性地研究了測試案例生成(TCG)任務,提出了多維度指標,旨在嚴格量化測試套件的全面性。此外,我們引入了一種人機協作方法(SAGA),結合人類編程專長與LLM的推理能力,顯著提升了生成測試案例的覆蓋率和質量。同時,我們開發了TCGBench以促進TCG任務的研究。實驗表明,SAGA在TCGBench上的檢測率達到90.62%,驗證器準確率為32.58%。由SAGA合成的程式碼生成評估基準的驗證器準確率(Verifier Acc)比LiveCodeBench-v6高出10.78%。這些結果證明了我們所提出方法的有效性。我們希望這項工作能為建立可靠的LLM程式碼評估的可擴展基礎做出貢獻,進一步推進程式碼生成中的RLVR,並為自動化對抗測試合成和自適應基準集成鋪平道路。
English
Large language models (LLMs) have recently achieved notable success in
code-generation benchmarks such as HumanEval and LiveCodeBench. However, a
detailed examination reveals that these evaluation suites often comprise only a
limited number of homogeneous test cases, resulting in subtle faults going
undetected. This not only artificially inflates measured performance but also
compromises accurate reward estimation in reinforcement learning frameworks
utilizing verifiable rewards (RLVR). To address these critical shortcomings, we
systematically investigate the test-case generation (TCG) task by proposing
multi-dimensional metrics designed to rigorously quantify test-suite
thoroughness. Furthermore, we introduce a human-LLM collaborative method
(SAGA), leveraging human programming expertise with LLM reasoning capability,
aimed at significantly enhancing both the coverage and the quality of generated
test cases. In addition, we develop a TCGBench to facilitate the study of the
TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a
verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc)
of the code generation evaluation benchmark synthesized by SAGA is 10.78%
higher than that of LiveCodeBench-v6. These results demonstrate the
effectiveness of our proposed method. We hope this work contributes to building
a scalable foundation for reliable LLM code evaluation, further advancing RLVR
in code generation, and paving the way for automated adversarial test synthesis
and adaptive benchmark integration.