LLMはアルゴリズム問題の高品質なテストケースを生成できるか？ TestCase-Eval: 欠陥カバレッジとエクスポージャーの体系的な評価

要旨

本論文では、テストケース生成における大規模言語モデル（LLM）の体系的評価のための新しいベンチマークであるTestCase-Evalを紹介する。TestCase-Evalは、Codeforcesプラットフォームから収集した500のアルゴリズム問題と10万件の人手による解答を含む。このベンチマークは、以下の2つの重要なタスクに焦点を当てている：（1）Fault Coverage（故障カバレッジ）は、LLMが生成したテストセットが多様な入力シナリオを探り、幅広い潜在的な故障モードをカバーする能力を測定する。（2）Fault Exposure（故障曝露）は、LLMが特定の誤ったコード実装を明らかにするための特化したテスト入力を生成できるかどうかを評価する。我々は、19の最先端のオープンソースおよびプロプライエタリなLLMをTestCase-Evalで包括的に評価し、アルゴリズム問題に対する効果的なテストケース生成におけるそれらの強みと限界についての洞察を提供する。

English

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

LLMはアルゴリズム問題の高品質なテストケースを生成できるか？ TestCase-Eval: 欠陥カバレッジとエクスポージャーの体系的な評価

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

要旨

Support