CodeContests+：競技程式設計中的高品質測試案例生成

摘要

競技編程，因其高推理難度及精確的正確性反饋，已成為訓練與評估大型語言模型（LLMs）推理能力的關鍵任務。然而，儘管有大量公開的問題數據，如問題陳述與解決方案可供利用，這些問題的測試案例往往難以獲取。因此，測試案例生成是構建大規模數據集的必要任務，而測試案例的質量直接決定了評估的準確性。本文介紹了一種基於LLM的代理系統，該系統能為競技編程問題創建高質量的測試案例。我們將此系統應用於CodeContests數據集，並提出了一個測試案例改進的新版本，命名為CodeContests+。我們評估了CodeContests+中測試案例的質量。首先，我們利用172萬份帶有通過/失敗標籤的提交來檢驗這些測試案例在評估中的準確性。結果表明，CodeContests+相較於CodeContests實現了顯著更高的準確性，尤其是在真陽性率（TPR）上有顯著提升。隨後，我們在LLM強化學習（RL）中的實驗進一步證實，測試案例質量的提升為RL帶來了可觀的優勢。

English

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

CodeContests+：競技程式設計中的高品質測試案例生成

CodeContests+: High-Quality Test Case Generation for Competitive Programming

摘要

Support