LLPut：基於錯誤報告的輸入生成之大型語言模型研究

摘要

誘發失敗的輸入在診斷和分析軟體錯誤中扮演著至關重要的角色。錯誤報告通常包含這些輸入，開發者會提取它們以協助除錯。由於錯誤報告是以自然語言撰寫的，先前的研究已利用各種自然語言處理（NLP）技術來自動化提取這些輸入。隨著大型語言模型（LLMs）的出現，一個重要的研究問題隨之而來：生成式LLMs在從錯誤報告中提取誘發失敗的輸入方面，其效果如何？在本論文中，我們提出了LLPut，這是一種技術，用於實證評估三種開源生成式LLMs——LLaMA、Qwen和Qwen-Coder——在從錯誤報告中提取相關輸入的表現。我們在包含206份錯誤報告的數據集上進行了實驗評估，以衡量這些模型的準確性和有效性。我們的研究結果為生成式LLMs在自動化錯誤診斷中的能力與限制提供了深入的見解。

English

Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.

LLPut：基於錯誤報告的輸入生成之大型語言模型研究

LLPut: Investigating Large Language Models for Bug Report-Based Input Generation

摘要

Support