HardTests: LLMコーディング向け高品質テストケースの合成

要旨

検証器は大規模言語モデル（LLM）の推論において重要な役割を果たし、強化学習などのポストトレーニング技術に必要とされます。しかし、難易度の高いコーディング問題に対して信頼性の高い検証器を入手することは困難です。なぜなら、巧妙に偽装された誤った解決策は、慎重に人間が作成したエッジケースによってのみ検出されることが多く、そのようなケースを合成するのは難しいからです。この問題に対処するため、我々はLLMを利用した高品質なテスト合成のためのパイプライン「HARDTESTGEN」を提案します。このパイプラインを用いて、47,000の問題と合成された高品質なテストを含む包括的な競技プログラミングデータセット「HARDTESTS」をキュレーションしました。既存のテストと比較して、HARDTESTGENのテストは、LLMが生成したコードを評価する際に、精度が11.3ポイント、再現率が17.5ポイント高くなりました。より難しい問題では、精度の向上が40ポイントに達することもあります。また、HARDTESTSは、下流のコード生成性能を測定することで、モデルのトレーニングにおいてもより効果的であることが証明されました。我々は、このデータセットと合成パイプラインをhttps://leililab.github.io/HardTests/でオープンソースとして公開する予定です。

English

Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

HardTests: LLMコーディング向け高品質テストケースの合成

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

要旨

Support