자원도 없고, 벤치마크도 없고, 문제없을까? 무자원 언어에서 코드 생성을 위한 LLM 평가 및 개선

초록

대규모 언어 모델(LLM)은 소프트웨어 공학 작업의 자동화를 크게 발전시켰다. 대표적인 예로 코드 생성을 들 수 있는데, LLM이 자연어 설명을 바탕으로 특정 프로그래밍 언어로 코드를 생성하는 방식이다. 이 분야의 대부분 연구는 풍부한 훈련 데이터의 혜택을 받는 Python이나 Java 같은 고자원 언어에 초점을 맞춰 왔다. 상대적으로 적은 연구가 훈련 코퍼스에서 과소대표되는 저자원 언어를 탐구했으며, 반대로 LLM이 사실상 훈련 데이터를 전혀 접하지 못한 무자원 언어는 거의 연구되지 않은 상태로 남아 있다. 이러한 언어는 종종 산업 현장에서 등장하는데, 조직이 GitHub Copilot과 같은 상용 도구가 지원하지 않는 독점 언어나 도메인 특화 언어를 개발하는 경우가 이에 해당한다. 이로 인해 기업은 자체 사내 코드 추천 시스템을 구축해야 할 필요성이 발생한다. 이러한 맥락에서 가능한 해결책을 조사하기 위해, 우리는 훈련 데이터가 극히 적게 존재하는 최근 제안된 두 프로그래밍 언어를 기반으로 무자원 언어용 코드 생성 벤치마크 세 가지를 구축하여 공개한다. 이 벤치마크를 활용하여, 프롬프트 기반 기법뿐 아니라 사용 가능한 소량의 데이터를 활용한 사전 학습 및 미세 조정을 포함한 여러 해결책을 실험하여 LLM에 무자원 언어를 가르친다. 무자원 언어에 대해 추가 사전 학습이 가장 큰 성능 향상을 제공하지만, 이를 명령어 튜닝 모델에 직접 적용하면 명령어를 따르는 능력이 저하된다. 이 문제를 해결하기 위해, 기본 모델에서 시작하여 대상 언어에 대해 추가 사전 학습을 수행한 후, 명령어 모델의 가중치 차이 전이를 통해 명령어 수행 능력을 주입한다. 이러한 접근 방식은 무자원 환경에서 코드 생성 능력을 크게 향상시켜, 기업이 명령어 미세 조정의 계산 비용을 감당하지 않고도 특화된 명령어 모델을 저렴하게 배포할 수 있게 한다.

English

Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.