利用监视器指导具有全局上下文的代码语言模型

摘要

代码语言模型（LMs）在生成代码时，当周围代码提供足够的上下文时表现良好。但当需要使用另一个模块或库中定义的类型或功能时，特别是在训练过程中未见过的类型或功能时，情况就不同了。LMs缺乏对这种全局上下文的意识，导致出现幻觉，例如错误地使用其他文件中定义的类型。最近的研究尝试通过检索全局信息来增强本地上下文，然而这会使提示内容膨胀，或需要对架构进行修改和额外的训练。集成开发环境（IDEs）通过静态分析将全局上下文提供给开发人员，以帮助开发人员。我们将这种开发人员享受的辅助功能扩展到LMs。我们提出了一种监视器的概念，它利用后台静态分析来指导解码。与先验检索不同，静态分析在整个解码过程中迭代调用，根据需求提供最相关的建议。我们通过监视LM在为对象解除引用生成代码时的类型一致性使用标识符的情况，展示了我们提议的有用性。为了评估我们的方法，我们整理了PragmaticCode数据集，其中包含开源项目及其开发环境。在不同参数规模的模型上，我们展示了监视器引导解码始终提高LM生成与真实情况匹配的标识符的能力，同时提高了编译率和与真实情况的一致性。我们发现，在使用我们的监视器引导的情况下，参数较少的LM可以胜过更大的LM。通过监视器引导解码，SantaCoder-1.1B实现了比规模更大的text-davinci-003模型更好的编译率和下一个标识符匹配。数据集和代码将在https://aka.ms/monitors4codegen发布。

English

Language models of code (LMs) work well when the surrounding code in the vicinity of generation provides sufficient context. This is not true when it becomes necessary to use types or functionality defined in another module or library, especially those not seen during training. LMs suffer from limited awareness of such global context and end up hallucinating, e.g., using types defined in other files incorrectly. Recent work tries to overcome this issue by retrieving global information to augment the local context. However, this bloats the prompt or requires architecture modifications and additional training. Integrated development environments (IDEs) assist developers by bringing the global context at their fingertips using static analysis. We extend this assistance, enjoyed by developers, to the LMs. We propose a notion of monitors that use static analysis in the background to guide the decoding. Unlike a priori retrieval, static analysis is invoked iteratively during the entire decoding process, providing the most relevant suggestions on demand. We demonstrate the usefulness of our proposal by monitoring for type-consistent use of identifiers whenever an LM generates code for object dereference. To evaluate our approach, we curate PragmaticCode, a dataset of open-source projects with their development environments. On models of varying parameter scale, we show that monitor-guided decoding consistently improves the ability of an LM to not only generate identifiers that match the ground truth but also improves compilation rates and agreement with ground truth. We find that LMs with fewer parameters, when guided with our monitor, can outperform larger LMs. With monitor-guided decoding, SantaCoder-1.1B achieves better compilation rate and next-identifier match than the much larger text-davinci-003 model. The datasets and code will be released at https://aka.ms/monitors4codegen .

利用监视器指导具有全局上下文的代码语言模型

Guiding Language Models of Code with Global Context using Monitors

摘要

Support