CodeContests+:競技程式設計中的高品質測試案例生成
CodeContests+: High-Quality Test Case Generation for Competitive Programming
June 6, 2025
作者: Zihan Wang, Siyao Liu, Yang Sun, Hongyan Li, Kai Shen
cs.AI
摘要
競技編程,因其高推理難度及精確的正確性反饋,已成為訓練與評估大型語言模型(LLMs)推理能力的關鍵任務。然而,儘管有大量公開的問題數據,如問題陳述與解決方案可供利用,這些問題的測試案例往往難以獲取。因此,測試案例生成是構建大規模數據集的必要任務,而測試案例的質量直接決定了評估的準確性。本文介紹了一種基於LLM的代理系統,該系統能為競技編程問題創建高質量的測試案例。我們將此系統應用於CodeContests數據集,並提出了一個測試案例改進的新版本,命名為CodeContests+。我們評估了CodeContests+中測試案例的質量。首先,我們利用172萬份帶有通過/失敗標籤的提交來檢驗這些測試案例在評估中的準確性。結果表明,CodeContests+相較於CodeContests實現了顯著更高的準確性,尤其是在真陽性率(TPR)上有顯著提升。隨後,我們在LLM強化學習(RL)中的實驗進一步證實,測試案例質量的提升為RL帶來了可觀的優勢。
English
Competitive programming, due to its high reasoning difficulty and precise
correctness feedback, has become a key task for both training and evaluating
the reasoning capabilities of large language models (LLMs). However, while a
large amount of public problem data, such as problem statements and solutions,
is available, the test cases of these problems are often difficult to obtain.
Therefore, test case generation is a necessary task for building large-scale
datasets, and the quality of the test cases directly determines the accuracy of
the evaluation. In this paper, we introduce an LLM-based agent system that
creates high-quality test cases for competitive programming problems. We apply
this system to the CodeContests dataset and propose a new version with improved
test cases, named CodeContests+. We evaluated the quality of test cases in
CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels
to examine the accuracy of these test cases in evaluation. The results
indicated that CodeContests+ achieves significantly higher accuracy than
CodeContests, particularly with a notably higher True Positive Rate (TPR).
Subsequently, our experiments in LLM Reinforcement Learning (RL) further
confirmed that improvements in test case quality yield considerable advantages
for RL.