Javaソースコードにおけるテクニカルデットの検出を向上させるための豊かなデータセット

要旨

技術的負債（TD）は、開発者が問題に対してより効果的で設計がしっかりしているが時間がかかるアプローチよりも、迅速かつ簡単な解決策を選択した場合に生じる追加の作業とコストを表す用語です。自己認識技術的負債（SATD）は、開発者が意図的に文書化し認識する特定の技術的負債であり、通常はテキストコメントを通じて行われます。これらの自己認識コメントは技術的負債を特定するための有用なツールですが、既存のアプローチのほとんどは、TDのさまざまなカテゴリに関連する重要なトークンを捉えることに焦点を当てており、ソースコード自体に埋め込まれた豊富な情報を無視しています。最近の研究は、ソースコードに埋め込まれたコメントを分析することでSATDを検出することに焦点を当てており、ソースコードに含まれる技術的負債に取り組む研究はほとんど行われていませんでした。このようなギャップを埋めるために、本研究では、Stackコーパスにホストされている974のJavaプロジェクトからコメントとそれに関連するソースコードを分析することで、コードコメントによって特定された初のTDデータセットを編纂しました。実証評価により、得られたデータセットのコメントが最先端のSATD検出モデルの予測性能を向上させるのに役立つことがわかりました。さらに、分類されたソースコードを含めることで、さまざまなタイプの技術的負債を予測する精度が著しく向上します。この点において、当研究は二つの側面を持っています：（i）当データセットが将来の研究に触発し、技術的負債の認識に関連する様々な研究課題にインスピレーションを与えると信じています；（ii）提案された分類器は、編纂されたデータセットを用いたTDの検出に関する他の研究のベースラインとして役立つ可能性があります。

English

Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.

Javaソースコードにおけるテクニカルデットの検出を向上させるための豊かなデータセット

Improving the detection of technical debt in Java source code with an enriched dataset

要旨

Support