大语言模型能否生成高质量的算法问题测试用例？ TestCase-Eval：故障覆盖与暴露的系统性评估

摘要

我们推出了TestCase-Eval，这是一个用于系统评估大语言模型（LLMs）在测试用例生成方面表现的新基准。TestCase-Eval包含了来自Codeforces平台的500个算法问题及100,000个人工编写的解决方案。该基准聚焦于两大核心任务：(1) 故障覆盖率，衡量LLM生成的测试集如何有效探索多样化的输入场景，并覆盖广泛的潜在故障模式；(2) 故障暴露度，评估LLM能否设计出针对性的测试输入，以揭示特定错误代码实现。我们对19个领先的开源及专有LLM在TestCase-Eval上的表现进行了全面评估，深入剖析了它们在为算法问题生成有效测试用例方面的优势与局限。

English

We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.

大语言模型能否生成高质量的算法问题测试用例？ TestCase-Eval：故障覆盖与暴露的系统性评估

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

摘要

Support