大型語言模型的參數高效微調以進行單元測試生成：一項實證研究

摘要

大型語言模型（LLMs）如GitHub Copilot的出現顯著提升了程式設計人員的生產力，特別是在程式碼生成方面。然而，這些模型在沒有進行微調的情況下通常難以應對現實世界的任務。隨著LLMs變得更大且性能更好，針對專業任務的微調變得越來越昂貴。參數高效微調（PEFT）方法僅微調模型參數的子集，提供了一種有前途的解決方案，可以降低調整LLMs的計算成本，同時保持其性能。現有研究已探索了在各種與程式碼相關的任務中使用PEFT和LLMs，並發現PEFT技術的有效性取決於任務。在單元測試生成中應用PEFT技術仍未被充分探索。目前最先進的方法僅使用完全微調的LLMs來生成單元測試。本文研究了完全微調和各種PEFT方法，包括LoRA、（IA）^3和提示微調，在不同的模型架構和尺寸上。我們使用成熟的基準數據集來評估它們在單元測試生成中的有效性。我們的研究結果表明，PEFT方法可以提供與完全微調相當的性能，使專業微調更具可行性和成本效益。值得注意的是，就成本和資源利用而言，提示微調是最有效的，而LoRA在幾種情況下接近完全微調的效果。

English

The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.

大型語言模型的參數高效微調以進行單元測試生成：一項實證研究

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

摘要

Support