リソースもベンチマークもない？リソースのない言語におけるコード生成のためのLLMの評価と改善

要旨

大規模言語モデル（LLM）は、ソフトウェアエンジニアリングタスクの自動化を大幅に進歩させてきた。顕著な例の一つにコード生成があり、LLMは自然言語記述に基づいて指定されたプログラミング言語のコードを生成する。この分野の研究の多くは、豊富な訓練データの恩恵を受けるPythonやJavaなどの高リソース言語に焦点を当ててきた。一部の研究では、訓練コーパスでの出現頻度が低い低リソース言語を扱っている。対照的に、LLMが実質的に訓練データを全く見ていない無リソース言語は、ほとんど研究されていない。これらの言語は、組織がGitHub Copilotのような商用ツールでサポートされていない独自言語やドメイン固有言語を開発する産業界でしばしば出現する。その結果、企業は独自の社内コード推薦システムを展開する必要に迫られる。この文脈における可能な解決策を探るため、我々は、訓練データが非常に少ない最近提案された2つのプログラミング言語に基づいて、無リソース言語向けのコード生成ベンチマークを3つ構築し公開する。これらのベンチマークを用いて、プロンプトベースの手法や、利用可能な少数のデータを活用した事前学習とファインチューニングを含む、無リソース言語をLLMに教えるための複数の解決策を実験する。無リソース言語に対して最大の性能向上をもたらすのはさらなる事前学習であるが、それを指示チューニング済みモデルに直接適用すると、指示に従う能力が損なわれる。この問題に対処するため、ベースモデルから開始し、対象言語でさらに事前学習を行い、その後、指示モデルからの重み差分転送によって指示追従能力を注入する。このアプローチにより、無リソース環境でのコード生成能力が大幅に向上し、企業は指示ファインチューニングの計算コストを負担することなく、安価に特化型指示モデルを展開できるようになる。

English

Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.