LLPut:基於錯誤報告的輸入生成之大型語言模型研究
LLPut: Investigating Large Language Models for Bug Report-Based Input Generation
March 26, 2025
作者: Alif Al Hasan, Subarna Saha, Mia Mohammad Imran, Tarannum Shaila Zaman
cs.AI
摘要
誘發失敗的輸入在診斷和分析軟體錯誤中扮演著至關重要的角色。錯誤報告通常包含這些輸入,開發者會提取它們以協助除錯。由於錯誤報告是以自然語言撰寫的,先前的研究已利用各種自然語言處理(NLP)技術來自動化提取這些輸入。隨著大型語言模型(LLMs)的出現,一個重要的研究問題隨之而來:生成式LLMs在從錯誤報告中提取誘發失敗的輸入方面,其效果如何?在本論文中,我們提出了LLPut,這是一種技術,用於實證評估三種開源生成式LLMs——LLaMA、Qwen和Qwen-Coder——在從錯誤報告中提取相關輸入的表現。我們在包含206份錯誤報告的數據集上進行了實驗評估,以衡量這些模型的準確性和有效性。我們的研究結果為生成式LLMs在自動化錯誤診斷中的能力與限制提供了深入的見解。
English
Failure-inducing inputs play a crucial role in diagnosing and analyzing
software bugs. Bug reports typically contain these inputs, which developers
extract to facilitate debugging. Since bug reports are written in natural
language, prior research has leveraged various Natural Language Processing
(NLP) techniques for automated input extraction. With the advent of Large
Language Models (LLMs), an important research question arises: how effectively
can generative LLMs extract failure-inducing inputs from bug reports? In this
paper, we propose LLPut, a technique to empirically evaluate the performance of
three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in
extracting relevant inputs from bug reports. We conduct an experimental
evaluation on a dataset of 206 bug reports to assess the accuracy and
effectiveness of these models. Our findings provide insights into the
capabilities and limitations of generative LLMs in automated bug diagnosis.Summary
AI-Generated Summary