MPIrigen：通過特定領域語言模型生成MPI代碼

摘要

在跨多個節點擴展計算的迫切需求凸顯了高效並行計算的重要性，特別是在消息傳遞介面（MPI）整合領域。生成基於MPI的並行程序的具有挑戰性的並行編程任務一直未被探索。本研究首先調查了最先進語言模型在生成基於MPI的並行程序方面的性能。研究發現，廣泛使用的模型如GPT-3.5和PolyCoder（專門的多語言代碼模型）在生成基於MPI的程序時表現出顯著的性能下降，與通用程序相比。相反，預先在C和C++等MPI相關編程語言上進行過預訓練的領域特定模型如MonoCoder表現優於更大的模型。隨後，我們通過在HPCorpusMPI上對MonoCoder進行微調，引入了一個專用的基於MPI程序生成的下游任務。我們將結果稱為MPIrigen。我們提出了一種創新的預處理方法，僅在觀察完整代碼後進行完成，從而實現更廣泛上下文的更好完成。使用新穎的面向HPC的評估方法，對比GPT-3.5的零-shot性能，比較分析表明MPIrigen在生成精確的MPI函數方面表現出色，預測位置和函數的準確率高達0.8，對於參數預測的準確率超過0.9。這種量身定制解決方案的成功凸顯了領域特定微調在優化語言模型以生成並行計算代碼方面的重要性，為新一代自動並行化工具鋪平了道路。本工作的來源可在我們的GitHub MPIrigen存儲庫中找到：https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

English

The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

MPIrigen：通過特定領域語言模型生成MPI代碼

MPIrigen: MPI Code Generation through Domain-Specific Language Models

摘要

Support