无资源，无基准，没问题？评估与改进大语言模型针对无资源语言的代码生成

摘要

大语言模型（LLMs）显著推进了软件工程任务的自动化。一个典型的例子是代码生成：大语言模型根据自然语言描述，用指定的编程语言生成代码。该领域的大多数研究聚焦于高资源语言（如Python或Java），这些语言因丰富的训练数据而受益。少数工作探索了低资源语言——它们在训练语料库中代表性不足。相比之下，大语言模型几乎未见训练数据的无资源语言仍鲜有研究。这类语言常出现在工业界，企业开发专有或领域特定语言，这些语言不受GitHub Copilot等商业工具支持，导致公司需部署内部代码推荐器。为探索此类场景的可行解决方案，我们基于两种近期提出且训练数据极少的编程语言，构建并发布了三个无资源语言代码生成基准测试。利用这些基准，我们实验了多种教授大语言模型无资源语言的方法，包括基于提示的技术以及利用少量数据进行预训练和微调。尽管进一步预训练对无资源语言带来了最大的性能提升，但直接将其应用于指令微调模型会损害其遵循指令的能力。为解决此问题，我们从基础模型出发，先对目标语言进行进一步预训练，再通过从指令模型迁移权重差值注入指令遵循能力。该方法显著提升了无资源环境下的代码生成能力，使企业无需处理指令微调的计算成本，即可低成本部署专用指令模型。

English

Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.