ChatPaper.aiChatPaper

大语言模型能否生成高质量的算法问题测试用例? TestCase-Eval:故障覆盖与暴露的系统性评估

Can LLMs Generate High-Quality Test Cases for Algorithm Problems? TestCase-Eval: A Systematic Evaluation of Fault Coverage and Exposure

June 13, 2025
作者: Zheyuan Yang, Zexi Kuang, Xue Xia, Yilun Zhao
cs.AI

摘要

我们推出了TestCase-Eval,这是一个用于系统评估大语言模型(LLMs)在测试用例生成方面表现的新基准。TestCase-Eval包含了来自Codeforces平台的500个算法问题及100,000个人工编写的解决方案。该基准聚焦于两大核心任务:(1) 故障覆盖率,衡量LLM生成的测试集如何有效探索多样化的输入场景,并覆盖广泛的潜在故障模式;(2) 故障暴露度,评估LLM能否设计出针对性的测试输入,以揭示特定错误代码实现。我们对19个领先的开源及专有LLM在TestCase-Eval上的表现进行了全面评估,深入剖析了它们在为算法问题生成有效测试用例方面的优势与局限。
English
We introduce TestCase-Eval, a new benchmark for systematic evaluation of LLMs in test-case generation. TestCase-Eval includes 500 algorithm problems and 100,000 human-crafted solutions from the Codeforces platform. It focuses on two pivotal tasks: (1) Fault Coverage, which measures how well LLM-generated test sets probe diverse input scenarios and cover a wide range of potential failure modes. (2) Fault Exposure, which evaluates whether LLMs can craft a tailored test input that reveals a specific incorrect code implementation. We provide a comprehensive assessment of 19 state-of-the-art open-source and proprietary LLMs on TestCase-Eval, offering insights into their strengths and limitations in generating effective test cases for algorithm problems.
PDF152June 18, 2025