RM-RF:基于无运行单元测试评估的奖励模型
RM -RF: Reward Model for Run-Free Unit Test Evaluation
January 19, 2026
作者: Elena Bruches, Daniil Grebenkin, Mikhail Klementev, Vadim Alperovich, Roman Derunets, Dari Baturova, Georgy Mkrtchyan, Oleg Sedukhin, Ivan Bondarenko, Nikolay Bushkov, Stanislav Moiseev
cs.AI
摘要
我们提出RM-RF——一种轻量级奖励模型,用于对自动生成的单元测试进行免运行评估。该方法无需重复编译和执行候选测试,仅通过源代码和测试代码即可预测三个执行衍生信号:(1) 增强后的测试套件能否成功编译运行;(2) 生成的测试用例是否提高代码覆盖率;(3) 生成的测试用例是否提升变异杀死率。为训练和评估RM-RF,我们构建了包含焦点文件、测试文件及通过执行流水线标记的候选测试增量的多语言数据集(Java、Python、Go),并发布了用于对比评估的配套数据集与方法论。通过测试多种模型架构与调优机制(零样本、全量微调及基于LoRA的参数高效微调),模型在三个预测目标上平均F1值达到0.69。与传统编译运行工具相比,RM-RF在保持竞争力预测准确度的同时,显著降低了延迟与基础设施成本,可为大规模测试生成和基于强化学习的代码优化提供快速、可扩展的反馈机制。
English
We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.