运用基于属性的测试弥合大语言模型代码生成与验证的鸿沟

摘要

大型语言模型（LLMs）在代码生成方面表现出色，但确保其输出功能正确，尤其是在复杂编程任务中，仍是一个持续存在的挑战。尽管传统的测试驱动开发（TDD）为代码优化提供了一条路径，但其在LLMs上的有效性常因高质量测试用例的稀缺或自动化测试生成的缺陷而大打折扣，这些缺陷包括有偏见的测试或错误的输出预测，可能误导修正过程。本文提出了属性生成求解器（Property-Generated Solver），这是一个创新框架，它利用基于属性的测试（PBT）来验证高层次程序属性或不变式，而非依赖具体的输入输出示例。这些属性通常比直接预测详尽的测试预言更易于定义和验证，从而打破了“自我欺骗循环”，即测试可能与被验证代码共享缺陷的困境。属性生成求解器采用了两大协作的LLM代理：一个专注于代码生成与迭代优化的生成器，以及一个管理PBT生命周期并从属性违规中提炼语义丰富反馈的测试器。由此产生的全面且可操作的反馈随后指导生成器进行优化。通过将PBT确立为这一迭代闭环范式中的核心验证引擎，属性生成求解器为引导LLMs生成更正确且可泛化的代码提供了强有力的机制。在多个代码生成基准上的广泛实验结果表明，属性生成求解器相较于成熟的TDD方法，在pass@1指标上实现了显著提升，相对增益范围从23.1%到37.3%。

English

Large Language Models (LLMs) excel at code generation, but ensuring their outputs to be functionally correct, especially in complex programming tasks, is a persistent challenge. While traditional Test-Driven Development (TDD) offers a path for code refinement, its efficacy with LLMs is often undermined by the scarcity of high-quality test cases or the pitfalls of automated test generation, including biased tests or inaccurate output predictions that can misdirect the correction process. This paper introduces Property-Generated Solver, a novel framework that leverages Property-Based Testing (PBT) to validate high-level program properties or invariants, instead of relying on specific input-output examples. These properties are often simpler to define and verify than directly predicting exhaustive test oracles, breaking the "cycle of self-deception" where tests might share flaws with the code they are meant to validate. Property-Generated Solver employs two collaborative LLM-based agents: a Generator dedicated to code generation and iterative refinement, and a Tester that manages the PBT life-cycle and formulate semantically rich feedback from property violations. The resulting comprehensive and actionable feedback then guides the Generator in its refinement efforts. By establishing PBT as the core validation engine within this iterative, closed-loop paradigm, Property-Generated Solver provides a robust mechanism for steering LLMs towards more correct and generalizable code. Extensive experimental results on multiple code generation benchmarks demonstrate that Property-Generated Solver achieves substantial pass@1 improvements, ranging from 23.1% to 37.3% relative gains over established TDD methods.

运用基于属性的测试弥合大语言模型代码生成与验证的鸿沟

Use Property-Based Testing to Bridge LLM Code Generation and Validation

摘要

Support