通過豐富的數據集提高Java源代碼中技術債務的檢測

摘要

技術債（TD）是一個術語，用於描述當開發人員選擇快速簡便的解決方案而非更有效且設計良好但耗時的方法時，所產生的額外工作和成本。自我承認的技術債（SATD）是一種特定類型的技術債，開發人員有意識地記錄和承認，通常通過文字註釋。儘管這些自我承認的註釋對於識別技術債是一個有用的工具，但大多數現有方法都專注於捕獲與各種類型的TD相關的關鍵標記，忽略了源代碼本身所包含的豐富信息。最近的研究集中於通過分析源代碼中嵌入的註釋來檢測SATD，對於包含在源代碼中的技術債，卻鮮有相應的研究。為了填補這一空白，在這項研究中，通過分析來自Stack語料庫中974個Java項目的註釋及其相關的源代碼，我們編制了第一個由代碼註釋識別的TD數據集，並附帶其相關的源代碼。通過實證評估，我們發現所得數據集的註釋有助於提升最先進的SATD檢測模型的預測性能。更重要的是，包括分類的源代碼顯著提高了預測各種類型技術債的準確性。在這方面，我們的工作具有雙重意義：（i）我們相信我們的數據集將促進該領域未來的工作，激發與技術債識別相關的各種研究問題；（ii）所提出的分類器可能為通過精心編制的數據集進行TD檢測的其他研究提供基準。

English

Technical debt (TD) is a term used to describe the additional work and costs that emerge when developers have opted for a quick and easy solution to a problem, rather than a more effective and well-designed, but time-consuming approach. Self-Admitted Technical Debts (SATDs) are a specific type of technical debts that developers intentionally document and acknowledge, typically via textual comments. While these self-admitted comments are a useful tool for identifying technical debts, most of the existing approaches focus on capturing crucial tokens associated with various categories of TD, neglecting the rich information embedded within the source code itself. Recent research has focused on detecting SATDs by analyzing comments embedded in source code, and there has been little work dealing with technical debts contained in the source code. To fill such a gap, in this study, through the analysis of comments and their associated source code from 974 Java projects hosted in the Stack corpus, we curated the first ever dataset of TD identified by code comments, coupled with its associated source code. Through an empirical evaluation, we found out that the comments of the resulting dataset help enhance the prediction performance of state-of-the-art SATD detection models. More importantly, including the classified source code significantly improves the accuracy in predicting various types of technical debt. In this respect, our work is two-fold: (i) We believe that our dataset will catalyze future work in the domain, inspiring various research issues related to the recognition of technical debt; (ii) The proposed classifiers may serve as baselines for other studies on the detection of TD by means of the curated dataset.

通過豐富的數據集提高Java源代碼中技術債務的檢測

Improving the detection of technical debt in Java source code with an enriched dataset

摘要

Support