重新思考LLM代码生成的验证:从生成到测试
Rethinking Verification for LLM Code Generation: From Generation to Testing
July 9, 2025
作者: Zihan Ma, Taolin Zhang, Maosong Cao, Wenwei Zhang, Minnan Luo, Songyang Zhang, Kai Chen
cs.AI
摘要
大型语言模型(LLMs)近期在代码生成基准测试如HumanEval和LiveCodeBench中取得了显著成功。然而,深入分析发现,这些评估套件通常仅包含有限数量的同质测试案例,导致细微错误未被察觉。这不仅人为夸大了性能指标,还影响了采用可验证奖励的强化学习框架(RLVR)中的准确奖励估计。针对这些关键缺陷,我们系统性地研究了测试案例生成(TCG)任务,提出了多维指标,旨在严格量化测试套件的全面性。此外,我们引入了一种人机协作方法(SAGA),结合人类编程专长与LLM的推理能力,旨在显著提升生成测试案例的覆盖范围与质量。同时,我们开发了TCGBench以促进TCG任务的研究。实验表明,SAGA在TCGBench上的检测率达到90.62%,验证器准确率为32.58%。由SAGA合成的代码生成评估基准的验证器准确率(Verifier Acc)比LiveCodeBench-v6高出10.78%。这些结果证明了我们提出方法的有效性。我们期望这项工作能为构建可靠的LLM代码评估的可扩展基础做出贡献,进一步推动代码生成中的RLVR发展,并为自动化对抗测试合成和自适应基准集成铺平道路。
English
Large language models (LLMs) have recently achieved notable success in
code-generation benchmarks such as HumanEval and LiveCodeBench. However, a
detailed examination reveals that these evaluation suites often comprise only a
limited number of homogeneous test cases, resulting in subtle faults going
undetected. This not only artificially inflates measured performance but also
compromises accurate reward estimation in reinforcement learning frameworks
utilizing verifiable rewards (RLVR). To address these critical shortcomings, we
systematically investigate the test-case generation (TCG) task by proposing
multi-dimensional metrics designed to rigorously quantify test-suite
thoroughness. Furthermore, we introduce a human-LLM collaborative method
(SAGA), leveraging human programming expertise with LLM reasoning capability,
aimed at significantly enhancing both the coverage and the quality of generated
test cases. In addition, we develop a TCGBench to facilitate the study of the
TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a
verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc)
of the code generation evaluation benchmark synthesized by SAGA is 10.78%
higher than that of LiveCodeBench-v6. These results demonstrate the
effectiveness of our proposed method. We hope this work contributes to building
a scalable foundation for reliable LLM code evaluation, further advancing RLVR
in code generation, and paving the way for automated adversarial test synthesis
and adaptive benchmark integration.