HardTests:為LLM編程合成高品質測試用例
HardTests: Synthesizing High-Quality Test Cases for LLM Coding
May 30, 2025
作者: Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li
cs.AI
摘要
驗證器在大型語言模型(LLM)的推理過程中扮演著至關重要的角色,尤其是在強化學習等後訓練技術中不可或缺。然而,針對複雜的編碼問題,獲取可靠的驗證器頗具挑戰性,因為精心偽裝的錯誤解決方案可能僅能通過人工精心編寫的邊界案例來檢測,而這些案例又難以自動合成。為解決這一難題,我們提出了HARDTESTGEN,一個利用LLM進行高質量測試合成的流程。通過這一流程,我們精心整理了一個全面的競賽編程數據集HARDTESTS,包含47,000個問題及合成的高質量測試。與現有測試相比,HARDTESTGEN生成的測試在評估LLM生成的代碼時,精確度提升了11.3個百分點,召回率提高了17.5個百分點。對於更難的問題,精確度的提升幅度可達40個百分點。此外,HARDTESTS在模型訓練方面也展現出更高的效能,這通過下游代碼生成性能的測量得以驗證。我們將在https://leililab.github.io/HardTests/開源我們的數據集及合成流程。
English
Verifiers play a crucial role in large language model (LLM) reasoning, needed
by post-training techniques such as reinforcement learning. However, reliable
verifiers are hard to get for difficult coding problems, because a
well-disguised wrong solution may only be detected by carefully human-written
edge cases that are difficult to synthesize. To address this issue, we propose
HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this
pipeline, we curate a comprehensive competitive programming dataset HARDTESTS
with 47k problems and synthetic high-quality tests. Compared with existing
tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points
higher and recall that is 17.5 percentage points higher when evaluating
LLM-generated code. For harder problems, the improvement in precision can be as
large as 40 points. HARDTESTS also proves to be more effective for model
training, measured by downstream code generation performance. We will
open-source our dataset and synthesis pipeline at
https://leililab.github.io/HardTests/.