CodeContests+: 競技プログラミングのための高品質なテストケース生成

要旨

競技プログラミングは、その高度な推論難易度と正確性フィードバックの特性から、大規模言語モデル（LLMs）の推論能力を訓練および評価するための重要なタスクとなっている。しかし、問題文や解答例などの公開データは豊富に存在する一方で、これらの問題に対するテストケースはしばしば入手が困難である。そのため、大規模データセットを構築する上でテストケースの生成は不可欠な作業であり、テストケースの品質は評価の精度を直接的に決定する。本論文では、競技プログラミング問題に対して高品質なテストケースを生成するLLMベースのエージェントシステムを提案する。このシステムをCodeContestsデータセットに適用し、改良されたテストケースを備えた新バージョンであるCodeContests+を構築した。CodeContests+のテストケース品質を評価するため、まず、合格/不合格ラベルが付与された172万件の提出データを用いて、これらのテストケースの評価精度を検証した。その結果、CodeContests+はCodeContestsと比較して大幅に高い精度を達成し、特に真陽性率（TPR）が顕著に向上していることが示された。続いて、LLMの強化学習（RL）における実験を通じて、テストケース品質の向上がRLに大きな利点をもたらすことをさらに確認した。

English

Competitive programming, due to its high reasoning difficulty and precise correctness feedback, has become a key task for both training and evaluating the reasoning capabilities of large language models (LLMs). However, while a large amount of public problem data, such as problem statements and solutions, is available, the test cases of these problems are often difficult to obtain. Therefore, test case generation is a necessary task for building large-scale datasets, and the quality of the test cases directly determines the accuracy of the evaluation. In this paper, we introduce an LLM-based agent system that creates high-quality test cases for competitive programming problems. We apply this system to the CodeContests dataset and propose a new version with improved test cases, named CodeContests+. We evaluated the quality of test cases in CodeContestsPlus. First, we used 1.72 million submissions with pass/fail labels to examine the accuracy of these test cases in evaluation. The results indicated that CodeContests+ achieves significantly higher accuracy than CodeContests, particularly with a notably higher True Positive Rate (TPR). Subsequently, our experiments in LLM Reinforcement Learning (RL) further confirmed that improvements in test case quality yield considerable advantages for RL.

CodeContests+: 競技プログラミングのための高品質なテストケース生成

CodeContests+: High-Quality Test Case Generation for Competitive Programming

要旨

Support