ChatPaper.aiChatPaper

HardTests:为LLM编码合成高质量测试用例

HardTests: Synthesizing High-Quality Test Cases for LLM Coding

May 30, 2025
作者: Zhongmou He, Yee Man Choi, Kexun Zhang, Jiabao Ji, Junting Zhou, Dejia Xu, Ivan Bercovich, Aidan Zhang, Lei Li
cs.AI

摘要

验证器在大型语言模型(LLM)推理中扮演着至关重要的角色,特别是在强化学习等后训练技术中不可或缺。然而,针对复杂编程问题获取可靠的验证器颇具挑战,因为精心伪装的不正确解决方案可能仅能通过人工精心编写的边缘案例来发现,而这些案例难以自动合成。为解决这一难题,我们提出了HARDTESTGEN,一个利用LLM进行高质量测试合成的流程。通过该流程,我们构建了一个全面的竞争性编程数据集HARDTESTS,包含47,000个问题及合成的高质量测试。与现有测试相比,HARDTESTGEN生成的测试在评估LLM生成的代码时,精确度提升了11.3个百分点,召回率提高了17.5个百分点。对于更复杂的问题,精确度的提升幅度甚至可达40个百分点。此外,HARDTESTS在模型训练中也展现出更高的效率,这通过下游代码生成性能的衡量得以证实。我们将在https://leililab.github.io/HardTests/开源我们的数据集及合成流程。
English
Verifiers play a crucial role in large language model (LLM) reasoning, needed by post-training techniques such as reinforcement learning. However, reliable verifiers are hard to get for difficult coding problems, because a well-disguised wrong solution may only be detected by carefully human-written edge cases that are difficult to synthesize. To address this issue, we propose HARDTESTGEN, a pipeline for high-quality test synthesis using LLMs. With this pipeline, we curate a comprehensive competitive programming dataset HARDTESTS with 47k problems and synthetic high-quality tests. Compared with existing tests, HARDTESTGEN tests demonstrate precision that is 11.3 percentage points higher and recall that is 17.5 percentage points higher when evaluating LLM-generated code. For harder problems, the improvement in precision can be as large as 40 points. HARDTESTS also proves to be more effective for model training, measured by downstream code generation performance. We will open-source our dataset and synthesis pipeline at https://leililab.github.io/HardTests/.

Summary

AI-Generated Summary

PDF402June 2, 2025