RM-RF: 실행 없는 단위 테스트 평가를 위한 보상 모델

초록

RM-RF는 자동 생성된 단위 테스트를 실행 없이 평가하기 위한 경량 리워드 모델입니다. RM-RF는 후보 테스트를 반복적으로 컴파일하고 실행하는 대신, 소스 코드와 테스트 코드만으로부터 세 가지 실행 기반 신호를 예측합니다: (1) 추가된 테스트 슈트가 성공적으로 컴파일되고 실행되는지, (2) 생성된 테스트 케이스가 코드 커버리지를 높이는지, (3) 생성된 테스트 케이스가 mutation kill rate을 개선하는지. RM-RF를 훈련하고 평가하기 위해 우리는 실행 기반 파이프라인으로 레이블이 지정된 포컬 파일, 테스트 파일 및 후보 테스트 추가로 구성된 다국어(Java, Python, Go) 데이터셋을 구축하고, 비교 평가를 위한 관련 데이터셋과 방법론을 공개합니다. 우리는 여러 모델 패밀리와 튜닝 방식(제로샷, 전체 미세 조정, LoRA를 통한 PEFT)을 테스트하여 세 가지 목표에 대해 평균 F1 점수 0.69를 달성했습니다. 기존의 컴파일-실행 방식에 비해 RM-RF는 경쟁력 있는 예측 정확도를 유지하면서 대기 시간과 인프라 비용을 크게 절감하여 대규모 테스트 생성 및 RL 기반 코드 최적화를 위한 빠르고 확장 가능한 피드백을 제공합니다.

English

We present RM-RF, a lightweight reward model for run-free evaluation of automatically generated unit tests. Instead of repeatedly compiling and executing candidate tests, RM-RF predicts - from source and test code alone - three execution-derived signals: (1) whether the augmented test suite compiles and runs successfully, (2) whether the generated test cases increase code coverage, and (3) whether the generated test cases improve the mutation kill rate. To train and evaluate RM-RF we assemble a multilingual dataset (Java, Python, Go) of focal files, test files, and candidate test additions labeled by an execution-based pipeline, and we release an associated dataset and methodology for comparative evaluation. We tested multiple model families and tuning regimes (zero-shot, full fine-tuning, and PEFT via LoRA), achieving an average F1 of 0.69 across the three targets. Compared to conventional compile-and-run instruments, RM-RF provides substantially lower latency and infrastructure cost while delivering competitive predictive fidelity, enabling fast, scalable feedback for large-scale test generation and RL-based code optimization.

RM-RF: 실행 없는 단위 테스트 평가를 위한 보상 모델

RM -RF: Reward Model for Run-Free Unit Test Evaluation

초록

Support