CodeContests+:面向竞技编程的高质量测试用例生成
CodeContests+: High-Quality Test Case Generation for Competitive Programming
June 6, 2025
作者: Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, Kai Shen
cs.AI
摘要
由于竞争编程具有高推理难度和精确的正确性反馈,它已成为训练和评估大型语言模型(LLMs)推理能力的关键任务。然而,尽管大量公开的问题数据(如问题描述和解决方案)可供获取,这些问题的测试用例却往往难以获得。因此,测试用例生成是构建大规模数据集的必要任务,而测试用例的质量直接决定了评估的准确性。本文介绍了一种基于LLM的代理系统,该系统为竞争编程问题创建高质量的测试用例。我们将此系统应用于CodeContests数据集,并提出了一个改进测试用例的新版本,命名为CodeContests+。我们评估了CodeContests+中测试用例的质量。首先,我们使用了172万条带有通过/失败标签的提交记录来检验这些测试用例在评估中的准确性。结果表明,CodeContests+的准确性显著高于CodeContests,尤其是在真阳性率(TPR)方面表现尤为突出。随后,我们在LLM强化学习(RL)中的实验进一步证实,测试用例质量的提升为RL带来了显著的益处。
English
Competitive programming, due to its high reasoning difficulty and precise
correctness feedback, has become a key task for both training and evaluating
the reasoning capabilities of large language models (LLMs). However, while a
large amount of public problem data, such as problem statements and solutions,
is available, the test cases of these problems are often difficult to obtain.
Therefore, test case generation is a necessary task for building large-scale
datasets, and the quality of the test cases directly determines the accuracy of
the evaluation. In this paper, we introduce an LLM-based agent system that
creates high-quality test cases for competitive programming problems. We apply
this system to the CodeContests dataset and propose a new version with improved
test cases, named CodeContests+. We evaluated the quality of test cases in
CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels
to examine the accuracy of these test cases in evaluation. The results
indicated that CodeContests+ achieves significantly higher accuracy than
CodeContests, particularly with a notably higher True Positive Rate (TPR).
Subsequently, our experiments in LLM Reinforcement Learning (RL) further
confirmed that improvements in test case quality yield considerable advantages
for RL.Summary
AI-Generated Summary