MPIrigen: 도메인 특화 언어 모델을 통한 MPI 코드 생성

초록

수많은 노드에 걸쳐 계산을 확장해야 하는 절박한 필요성은 특히 메시지 전달 인터페이스(MPI) 통합 영역에서 효율적인 병렬 컴퓨팅의 중요성을 강조한다. MPI 기반 병렬 프로그램을 생성하는 도전적인 병렬 프로그래밍 작업은 아직까지 탐구되지 않은 상태로 남아 있다. 본 연구는 먼저 최신 언어 모델들이 MPI 기반 병렬 프로그램을 생성하는 데 있어서의 성능을 조사한다. 연구 결과에 따르면, GPT-3.5 및 PolyCoder(특화된 다국어 코드 모델)과 같은 널리 사용되는 모델들은 일반 목적 프로그램 생성에 비해 MPI 기반 프로그램 생성에서 현저한 성능 저하를 보인다. 반면, C와 C++와 같은 MPI 관련 프로그래밍 언어로 사전 학습된 MonoCoder와 같은 도메인 특화 모델들은 더 큰 모델들을 능가한다. 이후, 우리는 HPCorpusMPI에서 MonoCoder를 미세 조정하여 MPI 기반 프로그램 생성을 위한 전용 하위 작업을 소개한다. 이를 MPIrigen이라 명명한다. 우리는 전체 코드를 관찰한 후에만 완성을 위한 혁신적인 전처리를 제안하여 더 넓은 맥락에서 더 나은 완성을 가능하게 한다. 새로운 HPC 지향 평가 방법을 사용하여 GPT-3.5의 제로샷 성능과 비교 분석한 결과, MPIrigen은 위치 및 함수 예측에서 최대 0.8의 정확도로 정확한 MPI 함수를 생성하며, 인수 예측에서는 0.9 이상의 정확도를 보인다. 이 맞춤형 솔루션의 성공은 병렬 컴퓨팅 코드 생성을 위해 언어 모델을 최적화하는 데 있어 도메인 특화 미세 조정의 중요성을 강조하며, 새로운 세대의 자동 병렬화 도구의 길을 열어준다. 본 작업의 소스는 GitHub MPIrigen 저장소에서 확인할 수 있다: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

English

The imperative need to scale computation across numerous nodes highlights the significance of efficient parallel computing, particularly in the realm of Message Passing Interface (MPI) integration. The challenging parallel programming task of generating MPI-based parallel programs has remained unexplored. This study first investigates the performance of state-of-the-art language models in generating MPI-based parallel programs. Findings reveal that widely used models such as GPT-3.5 and PolyCoder (specialized multi-lingual code models) exhibit notable performance degradation, when generating MPI-based programs compared to general-purpose programs. In contrast, domain-specific models such as MonoCoder, which are pretrained on MPI-related programming languages of C and C++, outperform larger models. Subsequently, we introduce a dedicated downstream task of MPI-based program generation by fine-tuning MonoCoder on HPCorpusMPI. We call the resulting model as MPIrigen. We propose an innovative preprocessing for completion only after observing the whole code, thus enabling better completion with a wider context. Comparative analysis against GPT-3.5 zero-shot performance, using a novel HPC-oriented evaluation method, demonstrates that MPIrigen excels in generating accurate MPI functions up to 0.8 accuracy in location and function predictions, and with more than 0.9 accuracy for argument predictions. The success of this tailored solution underscores the importance of domain-specific fine-tuning in optimizing language models for parallel computing code generation, paving the way for a new generation of automatic parallelization tools. The sources of this work are available at our GitHub MPIrigen repository: https://github.com/Scientific-Computing-Lab-NRCN/MPI-rigen

MPIrigen: 도메인 특화 언어 모델을 통한 MPI 코드 생성

MPIrigen: MPI Code Generation through Domain-Specific Language Models

초록

Summary

Support

Support