LLMコード生成の検証を再考する：生成からテストへ

要旨

大規模言語モデル（LLM）は最近、HumanEvalやLiveCodeBenchなどのコード生成ベンチマークで顕著な成功を収めています。しかし、詳細な検証を行うと、これらの評価スイートはしばしば限られた数の均質なテストケースで構成されており、微妙な欠陥が検出されないままになっていることが明らかになりました。これは、測定された性能を人為的に誇張するだけでなく、検証可能な報酬を利用した強化学習フレームワーク（RLVR）における正確な報酬推定を損なうことにもつながります。これらの重要な欠点に対処するため、我々はテストケース生成（TCG）タスクを体系的に調査し、テストスイートの徹底性を厳密に定量化するための多次元メトリクスを提案します。さらに、人間のプログラミング専門知識とLLMの推論能力を活用した人間-LLM協働手法（SAGA）を導入し、生成されるテストケースのカバレッジと品質を大幅に向上させることを目指します。加えて、TCGタスクの研究を促進するためのTCGBenchを開発しました。実験結果によると、SAGAはTCGBenchにおいて90.62%の検出率と32.58%の検証精度を達成しています。SAGAによって合成されたコード生成評価ベンチマークの検証精度（Verifier Acc）は、LiveCodeBench-v6よりも10.78%高くなっています。これらの結果は、提案手法の有効性を示しています。我々は、この研究が信頼性の高いLLMコード評価のためのスケーラブルな基盤を構築し、コード生成におけるRLVRをさらに進展させ、自動化された敵対的テスト合成と適応的ベンチマーク統合への道を開くことに貢献することを期待しています。

English

Large language models (LLMs) have recently achieved notable success in code-generation benchmarks such as HumanEval and LiveCodeBench. However, a detailed examination reveals that these evaluation suites often comprise only a limited number of homogeneous test cases, resulting in subtle faults going undetected. This not only artificially inflates measured performance but also compromises accurate reward estimation in reinforcement learning frameworks utilizing verifiable rewards (RLVR). To address these critical shortcomings, we systematically investigate the test-case generation (TCG) task by proposing multi-dimensional metrics designed to rigorously quantify test-suite thoroughness. Furthermore, we introduce a human-LLM collaborative method (SAGA), leveraging human programming expertise with LLM reasoning capability, aimed at significantly enhancing both the coverage and the quality of generated test cases. In addition, we develop a TCGBench to facilitate the study of the TCG task. Experiments show that SAGA achieves a detection rate of 90.62% and a verifier accuracy of 32.58% on TCGBench. The Verifier Accuracy (Verifier Acc) of the code generation evaluation benchmark synthesized by SAGA is 10.78% higher than that of LiveCodeBench-v6. These results demonstrate the effectiveness of our proposed method. We hope this work contributes to building a scalable foundation for reliable LLM code evaluation, further advancing RLVR in code generation, and paving the way for automated adversarial test synthesis and adaptive benchmark integration.

LLMコード生成の検証を再考する：生成からテストへ

Rethinking Verification for LLM Code Generation: From Generation to Testing

要旨

Support