大型語言模型在多語言軟體漏洞檢測中的基準測試

摘要

近期生成式人工智慧的進步，促使大型語言模型（LLMs）在軟體工程領域得到廣泛應用，解決了許多長期存在的挑戰。然而，針對軟體安全中至關重要的軟體漏洞檢測（SVD）能力，目前尚缺乏全面的研究。現有研究主要集中於使用C/C++資料集來評估LLMs，通常僅探討提示工程、指令微調和序列分類微調這三種策略中的一兩種，針對開源LLMs進行評估。因此，關於不同LLMs在檢測多種程式語言漏洞方面的有效性，存在顯著的知識缺口。為填補這一缺口，我們進行了一項全面的實證研究，評估LLMs在SVD任務上的表現。我們編制了一個包含8,260個Python漏洞函數、7,505個Java漏洞函數和28,983個JavaScript漏洞函數的綜合資料集。我們採用多種方法評估了五個開源LLMs，包括提示工程、指令微調和序列分類微調，並將這些LLMs與五個經過微調的小型語言模型及兩個開源靜態應用安全測試工具進行對比。此外，我們探索了兩種提升LLMs在SVD上表現的途徑：a) 資料角度：使用下采樣的平衡資料集重新訓練模型；b) 模型角度：研究結合多個LLMs預測的集成學習方法。我們全面的實驗表明，SVD對LLMs而言仍是一項具有挑戰性的任務。本研究深入探討了LLMs在SVD中的角色，並為未來利用生成式人工智慧提升軟體安全實踐提供了實用的見解。

English

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

大型語言模型在多語言軟體漏洞檢測中的基準測試

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

摘要

Support