大型語言模型在多語言軟體漏洞檢測中的基準測試
Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection
March 3, 2025
作者: Ting Zhang, Chengran Yang, Yindu Su, Martin Weyssow, Hung Nguyen, Tan Bui, Hong Jin Kang, Yikun Li, Eng Lieh Ouh, Lwin Khin Shar, David Lo
cs.AI
摘要
近期生成式人工智慧的進步,促使大型語言模型(LLMs)在軟體工程領域得到廣泛應用,解決了許多長期存在的挑戰。然而,針對軟體安全中至關重要的軟體漏洞檢測(SVD)能力,目前尚缺乏全面的研究。現有研究主要集中於使用C/C++資料集來評估LLMs,通常僅探討提示工程、指令微調和序列分類微調這三種策略中的一兩種,針對開源LLMs進行評估。因此,關於不同LLMs在檢測多種程式語言漏洞方面的有效性,存在顯著的知識缺口。為填補這一缺口,我們進行了一項全面的實證研究,評估LLMs在SVD任務上的表現。我們編制了一個包含8,260個Python漏洞函數、7,505個Java漏洞函數和28,983個JavaScript漏洞函數的綜合資料集。我們採用多種方法評估了五個開源LLMs,包括提示工程、指令微調和序列分類微調,並將這些LLMs與五個經過微調的小型語言模型及兩個開源靜態應用安全測試工具進行對比。此外,我們探索了兩種提升LLMs在SVD上表現的途徑:a) 資料角度:使用下采樣的平衡資料集重新訓練模型;b) 模型角度:研究結合多個LLMs預測的集成學習方法。我們全面的實驗表明,SVD對LLMs而言仍是一項具有挑戰性的任務。本研究深入探討了LLMs在SVD中的角色,並為未來利用生成式人工智慧提升軟體安全實踐提供了實用的見解。
English
Recent advancements in generative AI have led to the widespread adoption of
large language models (LLMs) in software engineering, addressing numerous
long-standing challenges. However, a comprehensive study examining the
capabilities of LLMs in software vulnerability detection (SVD), a crucial
aspect of software security, is currently lacking. Existing research primarily
focuses on evaluating LLMs using C/C++ datasets. It typically explores only one
or two strategies among prompt engineering, instruction tuning, and sequence
classification fine-tuning for open-source LLMs. Consequently, there is a
significant knowledge gap regarding the effectiveness of diverse LLMs in
detecting vulnerabilities across various programming languages. To address this
knowledge gap, we present a comprehensive empirical study evaluating the
performance of LLMs on the SVD task. We have compiled a comprehensive dataset
comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in
JavaScript. We assess five open-source LLMs using multiple approaches,
including prompt engineering, instruction tuning, and sequence classification
fine-tuning. These LLMs are benchmarked against five fine-tuned small language
models and two open-source static application security testing tools.
Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data
perspective: Retraining models using downsampled balanced datasets. b) Model
perspective: Investigating ensemble learning methods that combine predictions
from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains
a challenging task for LLMs. This study provides a thorough understanding of
the role of LLMs in SVD and offers practical insights for future advancements
in leveraging generative AI to enhance software security practices.