LLPut: 버그 리포트 기반 입력 생성을 위한 대형 언어 모델 연구

초록

실패 유발 입력은 소프트웨어 버그를 진단하고 분석하는 데 중요한 역할을 합니다. 버그 보고서에는 일반적으로 이러한 입력이 포함되어 있으며, 개발자들은 이를 추출하여 디버깅을 용이하게 합니다. 버그 보고서는 자연어로 작성되기 때문에, 기존 연구에서는 자동화된 입력 추출을 위해 다양한 자연어 처리(NLP) 기술을 활용해 왔습니다. 대규모 언어 모델(LLM)의 등장과 함께, 생성형 LLM이 버그 보고서에서 실패 유발 입력을 얼마나 효과적으로 추출할 수 있는지에 대한 중요한 연구 질문이 제기되었습니다. 본 논문에서는 LLaMA, Qwen, Qwen-Coder라는 세 가지 오픈소스 생성형 LLM의 성능을 실증적으로 평가하기 위한 LLPut 기법을 제안합니다. 우리는 206개의 버그 보고서 데이터셋을 대상으로 실험적 평가를 수행하여 이러한 모델들의 정확성과 효과성을 평가했습니다. 우리의 연구 결과는 자동화된 버그 진단에서 생성형 LLM의 능력과 한계에 대한 통찰을 제공합니다.

English

Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.

LLPut: 버그 리포트 기반 입력 생성을 위한 대형 언어 모델 연구

LLPut: Investigating Large Language Models for Bug Report-Based Input Generation

초록

Support