ChatPaper.aiChatPaper

无资源,无基准,没问题?评估与改进大语言模型针对无资源语言的代码生成

No Resource, No Benchmarks, No Problem? Evaluating and Improving LLMs for Code Generation in No-Resource Languages

June 15, 2026
作者: Alessandro Giagnorio, Alberto Martin-Lopez, Gabriele Bavota
cs.AI

摘要

大语言模型(LLMs)显著推进了软件工程任务的自动化。一个典型的例子是代码生成:大语言模型根据自然语言描述,用指定的编程语言生成代码。该领域的大多数研究聚焦于高资源语言(如Python或Java),这些语言因丰富的训练数据而受益。少数工作探索了低资源语言——它们在训练语料库中代表性不足。相比之下,大语言模型几乎未见训练数据的无资源语言仍鲜有研究。这类语言常出现在工业界,企业开发专有或领域特定语言,这些语言不受GitHub Copilot等商业工具支持,导致公司需部署内部代码推荐器。为探索此类场景的可行解决方案,我们基于两种近期提出且训练数据极少的编程语言,构建并发布了三个无资源语言代码生成基准测试。利用这些基准,我们实验了多种教授大语言模型无资源语言的方法,包括基于提示的技术以及利用少量数据进行预训练和微调。尽管进一步预训练对无资源语言带来了最大的性能提升,但直接将其应用于指令微调模型会损害其遵循指令的能力。为解决此问题,我们从基础模型出发,先对目标语言进行进一步预训练,再通过从指令模型迁移权重差值注入指令遵循能力。该方法显著提升了无资源环境下的代码生成能力,使企业无需处理指令微调的计算成本,即可低成本部署专用指令模型。
English
Large Language Models (LLMs) have significantly advanced the automation of software engineering tasks. One prominent example is code generation, where an LLM produces code in a specified programming language based on a natural language description. Most research in this area has focused on high-resource languages, such as Python or Java, which benefit from abundant training data. A smaller body of work has explored low-resource languages, which are underrepresented in training corpora. In contrast, no-resource languages for which LLMs have seen virtually no training data remain largely unstudied. These languages often emerge in industry, where organizations develop proprietary or domain-specific languages unsupported by commercial tools like GitHub Copilot. This results in the need for companies to deploy their own in-house code recommenders. To investigate possible solutions in this context, we build and release three code generation benchmarks for no-resource languages, based on two recently proposed programming languages for which very little training data is available. Using these benchmarks, we experiment several solutions to teach LLMs about no-resource languages, including prompt-based techniques as well as pre-training and fine-tuning exploiting the little data available. While further pre-training gives the largest performance gains for no-resource languages, applying it directly to instruction-tuned models harms their ability to follow instructions. To address this, we start from a base model, further pre-training it on the target language, and then inject instruction-following capabilities via weight diff transfer from an instruction model. Such an approach significantly improves code generation capabilities in no-resource settings, allowing companies to cheaply deploy a specialized instruct model without dealing with the computational cost of instruction fine-tuning.