ChatPaper.aiChatPaper

CodeContests+:面向竞技编程的高质量测试用例生成

CodeContests+: High-Quality Test Case Generation for Competitive Programming

June 6, 2025
作者: Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, Kai Shen
cs.AI

摘要

由于竞争编程具有高推理难度和精确的正确性反馈,它已成为训练和评估大型语言模型(LLMs)推理能力的关键任务。然而,尽管大量公开的问题数据(如问题描述和解决方案)可供获取,这些问题的测试用例却往往难以获得。因此,测试用例生成是构建大规模数据集的必要任务,而测试用例的质量直接决定了评估的准确性。本文介绍了一种基于LLM的代理系统,该系统为竞争编程问题创建高质量的测试用例。我们将此系统应用于CodeContests数据集,并提出了一个改进测试用例的新版本,命名为CodeContests+。我们评估了CodeContests+中测试用例的质量。首先,我们使用了172万条带有通过/失败标签的提交记录来检验这些测试用例在评估中的准确性。结果表明,CodeContests+的准确性显著高于CodeContests,尤其是在真阳性率(TPR)方面表现尤为突出。随后,我们在LLM强化学习(RL)中的实验进一步证实,测试用例质量的提升为RL带来了显著的益处。
English
Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

Summary

AI-Generated Summary

PDF81June 9, 2025