自動デバッグのための単体テスト生成の学習

要旨

ユニットテスト（UTs）は、コードの正確性を評価するだけでなく、大規模言語モデル（LLM）にフィードバックを提供する重要な役割を果たし、反復的に誤ったコードをデバッグする際に自動テスト生成を促進します。しかし、誤ったコードが与えられた際にエラーを明らかにするユニットテスト入力を生成すると同時に、ゴールドソリューションにアクセスせずにユニットテスト出力を正しく予測することの間にトレードオフがあることが明らかになりました。このトレードオフに対処するために、タスクの説明と候補コードに基づいて、エラーを明らかにするユニットテスト入力とそれらの正しい期待される出力を生成するようLLMsに教えるUTGenを提案します。我々は、生成されたテストを使用してLLMsが効果的にデバッグするのを支援する頑健なデバッグパイプラインであるUTDebugにUTGenを統合します。モデルが生成したテストはノイズの信号を提供する可能性があるため（例：正しく予測されなかった出力から）、UTDebugは（i）UT出力予測を改善するためにテスト時計算を介してUTGenをスケーリングし、（ii）過学習を避けるために複数の生成されたUTに基づいて編集を検証およびバックトラックします。UTGenは、エラーを明らかにするUT入力と正しいUT出力の両方の存在を測定するメトリックに基づいて、UT生成のベースラインを7.59％上回ることを示しています。UTDebugと併用すると、UTGenのユニットテストからのフィードバックが、HumanEvalFixとMBPP+のより難しいデバッグ分割におけるQwen-2.5 7Bのpass@1精度を、他のLLMベースのUT生成のベースラインよりもそれぞれ3％と12.35％以上向上させることがわかります。

English

Unit tests (UTs) play an instrumental role in assessing code correctness as well as providing feedback to a large language model (LLM) as it iteratively debugs faulty code, motivating automated test generation. However, we uncover a trade-off between generating unit test inputs that reveal errors when given a faulty code and correctly predicting the unit test output without access to the gold solution. To address this trade-off, we propose UTGen, which teaches LLMs to generate unit test inputs that reveal errors along with their correct expected outputs based on task descriptions and candidate code. We integrate UTGen into UTDebug, a robust debugging pipeline that uses generated tests to help LLMs debug effectively. Since model-generated tests can provide noisy signals (e.g., from incorrectly predicted outputs), UTDebug (i) scales UTGen via test-time compute to improve UT output prediction, and (ii) validates and back-tracks edits based on multiple generated UTs to avoid overfitting. We show that UTGen outperforms UT generation baselines by 7.59% based on a metric measuring the presence of both error-revealing UT inputs and correct UT outputs. When used with UTDebug, we find that feedback from UTGen's unit tests improves pass@1 accuracy of Qwen-2.5 7B on HumanEvalFix and our own harder debugging split of MBPP+ by over 3% and 12.35% (respectively) over other LLM-based UT generation baselines.

自動デバッグのための単体テスト生成の学習

Learning to Generate Unit Tests for Automated Debugging

要旨

Support