LLM 코드 생성을 위한 검증 재고찰: 생성에서 테스트로

초록

대규모 언어 모델(LLM)은 최근 HumanEval 및 LiveCodeBench와 같은 코드 생성 벤치마크에서 주목할 만한 성과를 거두었습니다. 그러나 자세히 살펴보면 이러한 평가 스위트는 종종 제한된 수의 동질적인 테스트 케이스로 구성되어 있어 미묘한 결함이 탐지되지 않는 경우가 많습니다. 이는 측정된 성능을 인위적으로 부풀릴 뿐만 아니라 검증 가능한 보상을 활용하는 강화 학습 프레임워크(RLVR)에서 정확한 보상 추정을 저해합니다. 이러한 중요한 단점을 해결하기 위해, 우리는 테스트 스위트의 철저성을 엄격하게 정량화하기 위해 다차원 메트릭을 제안하여 테스트 케이스 생성(TCG) 작업을 체계적으로 조사합니다. 더 나아가, 인간의 프로그래밍 전문 지식과 LLM의 추론 능력을 결합한 인간-LLM 협업 방법(SAGA)을 도입하여 생성된 테스트 케이스의 커버리지와 품질을 크게 향상시키고자 합니다. 또한, TCG 작업 연구를 용이하게 하기 위해 TCGBench를 개발했습니다. 실험 결과, SAGA는 TCGBench에서 90.62%의 탐지율과 32.58%의 검증기 정확도를 달성했습니다. SAGA가 합성한 코드 생성 평가 벤치마크의 검증기 정확도(Verifier Acc)는 LiveCodeBench-v6보다 10.78% 더 높았습니다. 이러한 결과는 우리가 제안한 방법의 효과를 입증합니다. 우리는 이 작업이 신뢰할 수 있는 LLM 코드 평가를 위한 확장 가능한 기반을 구축하고, 코드 생성에서 RLVR을 더욱 발전시키며, 자동화된 적대적 테스트 합성과 적응형 벤치마크 통합의 길을 열어가는 데 기여하기를 바랍니다.

English

Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.

LLM 코드 생성을 위한 검증 재고찰: 생성에서 테스트로

Rethinking Verification for LLM Code Generation: From Generation to Testing

초록

Support