大規模言語モデルのパラメータ効率の良いファインチューニングによるユニットテスト生成：経験的研究

要旨

大規模言語モデル（LLMs）の登場、例えばGitHub Copilotのようなものは、特にコード生成においてプログラマーの生産性を大幅に向上させました。しかしながら、これらのモデルはしばしば、微調整なしでは実世界のタスクに苦戦します。LLMsがより大きく、より高性能になるにつれ、専門タスク向けの微調整はますます高コストとなります。パラメータ効率の微調整（PEFT）手法は、モデルパラメータのサブセットのみを微調整することで、LLMsの調整の計算コストを削減しつつ性能を維持する有望な解決策を提供します。既存の研究では、PEFTとLLMsを様々なコード関連タスクに使用し、PEFT技術の効果はタスクに依存することが分かっています。単体テスト生成におけるPEFT技術の適用は未開拓のままです。最先端技術は、単体テストを生成するためにLLMsを完全微調整することに限定されています。本論文では、完全微調整とLoRA、（IA）^3、prompt tuningを含む様々なPEFT手法を異なるモデルアーキテクチャとサイズで調査します。我々は、確立されたベンチマークデータセットを使用して、単体テスト生成における彼らの効果を評価します。我々の調査結果は、PEFT手法が専門的微調整をよりアクセスしやすく、コスト効果的にすることができ、単体テスト生成において完全微調整と同等の性能を提供できることを示しています。特に、prompt tuningがコストとリソース利用の面で最も効果的であり、LoRAはいくつかのケースで完全微調整の効果に匹敵しています。

English

The advent of large language models (LLMs) like GitHub Copilot has significantly enhanced programmers' productivity, particularly in code generation. However, these models often struggle with real-world tasks without fine-tuning. As LLMs grow larger and more performant, fine-tuning for specialized tasks becomes increasingly expensive. Parameter-efficient fine-tuning (PEFT) methods, which fine-tune only a subset of model parameters, offer a promising solution by reducing the computational costs of tuning LLMs while maintaining their performance. Existing studies have explored using PEFT and LLMs for various code-related tasks and found that the effectiveness of PEFT techniques is task-dependent. The application of PEFT techniques in unit test generation remains underexplored. The state-of-the-art is limited to using LLMs with full fine-tuning to generate unit tests. This paper investigates both full fine-tuning and various PEFT methods, including LoRA, (IA)^3, and prompt tuning, across different model architectures and sizes. We use well-established benchmark datasets to evaluate their effectiveness in unit test generation. Our findings show that PEFT methods can deliver performance comparable to full fine-tuning for unit test generation, making specialized fine-tuning more accessible and cost-effective. Notably, prompt tuning is the most effective in terms of cost and resource utilization, while LoRA approaches the effectiveness of full fine-tuning in several cases.

大規模言語モデルのパラメータ効率の良いファインチューニングによるユニットテスト生成：経験的研究

Parameter-Efficient Fine-Tuning of Large Language Models for Unit Test Generation: An Empirical Study

要旨

Support