CodeContests+：面向竞技编程的高质量测试用例生成

摘要

由于竞争编程具有高推理难度和精确的正确性反馈，它已成为训练和评估大型语言模型（LLMs）推理能力的关键任务。然而，尽管大量公开的问题数据（如问题描述和解决方案）可供获取，这些问题的测试用例却往往难以获得。因此，测试用例生成是构建大规模数据集的必要任务，而测试用例的质量直接决定了评估的准确性。本文介绍了一种基于LLM的代理系统，该系统为竞争编程问题创建高质量的测试用例。我们将此系统应用于CodeContests数据集，并提出了一个改进测试用例的新版本，命名为CodeContests+。我们评估了CodeContests+中测试用例的质量。首先，我们使用了172万条带有通过/失败标签的提交记录来检验这些测试用例在评估中的准确性。结果表明，CodeContests+的准确性显著高于CodeContests，尤其是在真阳性率（TPR）方面表现尤为突出。随后，我们在LLM强化学习（RL）中的实验进一步证实，测试用例质量的提升为RL带来了显著的益处。

English

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

CodeContests+：面向竞技编程的高质量测试用例生成

CodeContests+: High-Quality Test Case Generation for Competitive Programming

摘要

Support