學習生成單元測試以進行自動除錯
Learning to Generate Unit Tests for Automated Debugging
February 3, 2025
作者: Archiki Prasad, Elias Stengel-Eskin, Justin Chih-Yao Chen, Zaid Khan, Mohit Bansal
cs.AI
摘要
單元測試(UTs)在評估程式碼正確性以及為大型語言模型(LLM)提供反饋方面發揮著重要作用,因為它在迭代調試有錯誤的程式碼時激勵自動測試生成。然而,我們發現在生成單元測試輸入時存在一個折衷,即在給定有錯誤的程式碼時揭示錯誤並正確預測單元測試輸出之間存在折衷,而沒有訪問黃金解決方案。為了應對這種折衷,我們提出了UTGen,它教導LLMs生成單元測試輸入,根據任務描述和候選程式碼揭示錯誤以及它們的正確預期輸出。我們將UTGen整合到UTDebug中,這是一個強大的調試流程,使用生成的測試來幫助LLMs有效調試。由於模型生成的測試可能提供噪聲信號(例如,來自錯誤預測的輸出),UTDebug(i)通過測試時計算來擴展UTGen以改進UT輸出預測,以及(ii)基於多個生成的UT驗證和回溯編輯以避免過度擬合。我們展示了UTGen在基於測量錯誤顯示UT輸入和正確UT輸出存在的指標上,優於UT生成基準7.59%。當與UTDebug一起使用時,我們發現UTGen的單元測試反饋提高了Qwen-2.5 7B在HumanEvalFix上的pass@1準確性,以及在我們自己更難的MBPP+調試分割上,分別比其他基於LLM的UT生成基準提高了3%和12.35%。
English
Unit tests (UTs) play an instrumental role in assessing code correctness as
well as providing feedback to a large language model (LLM) as it iteratively
debugs faulty code, motivating automated test generation. However, we uncover a
trade-off between generating unit test inputs that reveal errors when given a
faulty code and correctly predicting the unit test output without access to the
gold solution. To address this trade-off, we propose UTGen, which teaches LLMs
to generate unit test inputs that reveal errors along with their correct
expected outputs based on task descriptions and candidate code. We integrate
UTGen into UTDebug, a robust debugging pipeline that uses generated tests to
help LLMs debug effectively. Since model-generated tests can provide noisy
signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen
via test-time compute to improve UT output prediction, and (ii) validates and
back-tracks edits based on multiple generated UTs to avoid overfitting. We show
that UTGen outperforms UT generation baselines by 7.59% based on a metric
measuring the presence of both error-revealing UT inputs and correct UT
outputs. When used with UTDebug, we find that feedback from UTGen's unit tests
improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder
debugging split of MBPP+ by over 3% and 12.35% (respectively) over other
LLM-based UT generation baselines.Summary
AI-Generated Summary