LLPut: バグレポートに基づく入力生成のための大規模言語モデルの調査

要旨

障害を引き起こす入力は、ソフトウェアのバグを診断・分析する上で重要な役割を果たします。バグレポートには通常、これらの入力が含まれており、開発者はデバッグを容易にするためにそれらを抽出します。バグレポートは自然言語で記述されているため、これまでの研究では自動入力抽出のために様々な自然言語処理（NLP）技術が活用されてきました。大規模言語モデル（LLM）の登場に伴い、重要な研究課題が浮上しています：生成型LLMはバグレポートから障害を引き起こす入力をどれだけ効果的に抽出できるのか？本論文では、LLPutという手法を提案し、オープンソースの生成型LLM（LLaMA、Qwen、Qwen-Coder）がバグレポートから関連する入力を抽出する性能を実証的に評価します。206件のバグレポートデータセットを用いた実験的評価を通じて、これらのモデルの精度と有効性を検証します。我々の知見は、自動バグ診断における生成型LLMの能力と限界についての洞察を提供します。

English

Failure-inducing inputs play a crucial role in diagnosing and analyzing software bugs. Bug reports typically contain these inputs, which developers extract to facilitate debugging. Since bug reports are written in natural language, prior research has leveraged various Natural Language Processing (NLP) techniques for automated input extraction. With the advent of Large Language Models (LLMs), an important research question arises: how effectively can generative LLMs extract failure-inducing inputs from bug reports? In this paper, we propose LLPut, a technique to empirically evaluate the performance of three open-source generative LLMs -- LLaMA, Qwen, and Qwen-Coder -- in extracting relevant inputs from bug reports. We conduct an experimental evaluation on a dataset of 206 bug reports to assess the accuracy and effectiveness of these models. Our findings provide insights into the capabilities and limitations of generative LLMs in automated bug diagnosis.

LLPut: バグレポートに基づく入力生成のための大規模言語モデルの調査

LLPut: Investigating Large Language Models for Bug Report-Based Input Generation

要旨

Support