MPIrigen：通过领域特定语言模型生成MPI代码

摘要

在跨多个节点扩展计算的迫切需求凸显了高效并行计算的重要性，特别是在消息传递接口（MPI）集成领域。生成基于MPI的并行程序是一项具有挑战性的并行编程任务，但却鲜为人知。本研究首先调查了最先进语言模型在生成基于MPI的并行程序方面的性能。研究结果显示，诸如GPT-3.5和PolyCoder（专门的多语言代码模型）等广泛使用的模型在生成基于MPI的程序时表现出明显的性能下降，与生成通用程序相比。相比之下，预训练于C和C++等MPI相关编程语言的领域特定模型，如MonoCoder，表现优于更大的模型。随后，我们通过在HPCorpusMPI上微调MonoCoder，引入了一个专门的MPI程序生成下游任务。我们将结果模型称为MPIrigen。我们提出了一种创新的预处理方法，仅在观察完整代码后进行完成，从而实现更好的完成效果和更广泛的上下文。通过使用一种新颖的面向HPC的评估方法，与GPT-3.5的零样本性能进行比较分析，结果显示MPIrigen在生成准确的MPI函数方面表现出色，位置和功能预测准确率高达0.8，参数预测准确率超过0.9。这种量身定制解决方案的成功凸显了领域特定微调在优化语言模型以生成并行计算代码方面的重要性，为新一代自动并行化工具铺平了道路。本工作的来源可在我们的GitHub MPIrigen存储库找到：https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

English

The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

MPIrigen：通过领域特定语言模型生成MPI代码

MPIrigen: MPI Code Generation through Domain-Specific Language Models

摘要

Summary

Support

Support