RM-RF: ランタイムフリーな単体テスト評価のための報酬モデル

要旨

本論文では、自動生成された単体テストを実行フリーで評価する軽量な報酬モデルRM-RFを提案する。RM-RFは、候補となるテストを繰り返しコンパイル・実行する代わりに、ソースコードとテストコードのみから、以下の3つの実行由来シグナルを予測する：(1) 拡張されたテストスイートが正常にコンパイル・実行されるか、(2) 生成されたテストケースがコードカバレッジを向上させるか、(3) 生成されたテストケースがミューテーション殺傷率を改善するか。RM-RFの学習と評価のために、我々は実行ベースのパイプラインでラベル付けされた焦点ファイル、テストファイル、候補テスト追加からなる多言語（Java、Python、Go）データセットを構築し、比較評価のための関連データセットと方法論を公開する。複数のモデルファミリーとチューニング手法（ゼロショット、フルファインチューニング、LoRAによるPEFT）を検証し、3つのターゲット全体で平均F1スコア0.69を達成した。従来のコンパイル・実行手法と比較して、RM-RFは競争力のある予測精度を維持しつつ、大幅に低いレイテンシとインフラコストを実現し、大規模なテスト生成や強化学習ベースのコード最適化における高速かつスケーラブルなフィードバックを可能にする。

English

We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.

RM-RF: ランタイムフリーな単体テスト評価のための報酬モデル

RM -RF: Reward Model for Run-Free Unit Test Evaluation

要旨

Support