大規模言語モデルを用いた多言語ソフトウェア脆弱性検出のベンチマーキング

要旨

近年の生成AIの進歩により、大規模言語モデル（LLMs）がソフトウェア工学において広く採用され、多くの長年の課題に対処されています。しかし、ソフトウェアセキュリティの重要な側面であるソフトウェア脆弱性検出（SVD）におけるLLMsの能力を包括的に検証する研究は、現在不足しています。既存の研究は主にC/C++データセットを使用してLLMsを評価しており、オープンソースのLLMsに対してプロンプトエンジニアリング、指示チューニング、シーケンス分類ファインチューニングのうち1つまたは2つの戦略のみを探求しています。その結果、さまざまなプログラミング言語における脆弱性検出における多様なLLMsの有効性に関する知識のギャップが大きくなっています。この知識のギャップを埋めるため、我々はSVDタスクにおけるLLMsの性能を評価する包括的な実証研究を提示します。我々は、Pythonの8,260件、Javaの7,505件、JavaScriptの28,983件の脆弱な関数を含む包括的なデータセットを構築しました。プロンプトエンジニアリング、指示チューニング、シーケンス分類ファインチューニングを含む複数のアプローチを使用して、5つのオープンソースLLMsを評価します。これらのLLMsは、5つのファインチューニングされた小規模言語モデルと2つのオープンソースの静的アプリケーションセキュリティテストツールに対してベンチマークされます。さらに、SVDにおけるLLMsの性能を向上させるための2つの方法を探求します：a) データの観点：ダウンサンプリングされたバランスの取れたデータセットを使用してモデルを再トレーニングする。b) モデルの観点：複数のLLMsからの予測を組み合わせるアンサンブル学習手法を調査する。我々の包括的な実験により、SVDはLLMsにとって依然として困難なタスクであることが示されています。本研究は、SVDにおけるLLMsの役割を徹底的に理解し、生成AIを活用してソフトウェアセキュリティプラクティスを強化するための将来の進歩に向けた実践的な洞察を提供します。

English

Recent advancements in generative AI have led to the widespread adoption of large language models (LLMs) in software engineering, addressing numerous long-standing challenges. However, a comprehensive study examining the capabilities of LLMs in software vulnerability detection (SVD), a crucial aspect of software security, is currently lacking. Existing research primarily focuses on evaluating LLMs using C/C++ datasets. It typically explores only one or two strategies among prompt engineering, instruction tuning, and sequence classification fine-tuning for open-source LLMs. Consequently, there is a significant knowledge gap regarding the effectiveness of diverse LLMs in detecting vulnerabilities across various programming languages. To address this knowledge gap, we present a comprehensive empirical study evaluating the performance of LLMs on the SVD task. We have compiled a comprehensive dataset comprising 8,260 vulnerable functions in Python, 7,505 in Java, and 28,983 in JavaScript. We assess five open-source LLMs using multiple approaches, including prompt engineering, instruction tuning, and sequence classification fine-tuning. These LLMs are benchmarked against five fine-tuned small language models and two open-source static application security testing tools. Furthermore, we explore two avenues to improve LLM performance on SVD: a) Data perspective: Retraining models using downsampled balanced datasets. b) Model perspective: Investigating ensemble learning methods that combine predictions from multiple LLMs. Our comprehensive experiments demonstrate that SVD remains a challenging task for LLMs. This study provides a thorough understanding of the role of LLMs in SVD and offers practical insights for future advancements in leveraging generative AI to enhance software security practices.

大規模言語モデルを用いた多言語ソフトウェア脆弱性検出のベンチマーキング

Benchmarking Large Language Models for Multi-Language Software Vulnerability Detection

要旨

Support