B4：朝着使用可信测试进行可信代码解决方案的最佳评估方向。

摘要

在代码生成中，从多个生成的代码解决方案中选择最佳解决方案是一项关键任务，可以通过使用一些可靠的验证器（例如，由开发人员编写的测试用例）来实现。由于可靠的测试用例并非总是可用，并且在实践中构建可能成本高昂，研究人员提出自动生成测试用例以评估代码解决方案。然而，当代码解决方案和测试用例均为合理且不可靠时，选择最佳解决方案变得具有挑战性。尽管已经提出了一些启发式策略来解决这个问题，但它们缺乏强大的理论保证，是否存在最佳选择策略仍然是一个悬而未决的问题。我们的工作在两个方面做出了贡献。首先，我们展示了在贝叶斯框架内，可以基于解决方案和测试之间观察到的通过状态的后验概率来定义最佳选择策略。然后，识别最佳解决方案的问题被构建为整数规划问题。其次，我们提出了一种有效的方法来近似这种最佳（但无法计算）策略，其中近似误差受先验知识正确性的限制。然后，我们结合有效的先验知识来定制代码生成任务。理论和实证研究均证实，现有的启发式方法在选择具有合理测试用例的最佳解决方案方面存在局限性。我们提出的近似最佳策略 B4 在选择由大型语言模型（LLMs）生成的代码解决方案时明显优于现有的启发式方法，实现了相对性能提升高达 50%，比最强启发式方法提高了 246%，超过了最具挑战性场景中随机选择的效果。我们的代码可在 https://github.com/ZJU-CTAG/B4 上公开获取。

English

Selecting the best code solution from multiple generated ones is an essential task in code generation, which can be achieved by using some reliable validators (e.g., developer-written test cases) for assistance. Since reliable test cases are not always available and can be expensive to build in practice, researchers propose to automatically generate test cases to assess code solutions. However, when both code solutions and test cases are plausible and not reliable, selecting the best solution becomes challenging. Although some heuristic strategies have been proposed to tackle this problem, they lack a strong theoretical guarantee and it is still an open question whether an optimal selection strategy exists. Our work contributes in two ways. First, we show that within a Bayesian framework, the optimal selection strategy can be defined based on the posterior probability of the observed passing states between solutions and tests. The problem of identifying the best solution is then framed as an integer programming problem. Second, we propose an efficient approach for approximating this optimal (yet uncomputable) strategy, where the approximation error is bounded by the correctness of prior knowledge. We then incorporate effective prior knowledge to tailor code generation tasks. Both theoretical and empirical studies confirm that existing heuristics are limited in selecting the best solutions with plausible test cases. Our proposed approximated optimal strategy B4 significantly surpasses existing heuristics in selecting code solutions generated by large language models (LLMs) with LLM-generated tests, achieving a relative performance improvement by up to 50% over the strongest heuristic and 246% over the random selection in the most challenging scenarios. Our code is publicly available at https://github.com/ZJU-CTAG/B4.

B4：朝着使用可信测试进行可信代码解决方案的最佳评估方向。

B4: Towards Optimal Assessment of Plausible Code Solutions with Plausible Tests

摘要

Support