基础自动评估器:面向推理核心领域的多任务生成式评估器训练规模化
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
October 20, 2025
作者: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
cs.AI
摘要
微调专用生成式评估器已成为满足训练和测试期间可扩展评估需求的热门范式。然而,近期研究主要集中于应用新方法(如强化学习,RL)来训练评估器,而避开了大规模数据驱动的发展。在本研究中,我们聚焦于数据扩展,精心策划了包含250万样本的数据集,涵盖五项独特的评估任务(成对比较、步骤级评估、无参考与有参考验证,以及单一评分)及多个专注于推理评估的领域。利用这些数据,我们采用简单的迭代拒绝采样监督微调(SFT)方法,训练了基础自动推理评估器(FARE)系列,包括8B和20B(其中3.6B为活跃参数)参数的评估器。FARE-8B挑战了更大规模的专用RL训练评估器,而FARE-20B则为开源评估器树立了新标杆,超越了70B+的专用评估器。除了静态基准测试外,我们还在实际任务中评估了FARE:作为推理时重排序器,FARE-20B在MATH任务上达到了近乎预言机的性能。作为RL训练中的验证器,FARE相较于字符串匹配验证器,将下游RL训练模型的性能提升了高达14.1%。当以FARE为起点进行持续微调时,FARE-Code在评估测试用例质量上,比gpt-oss-20B高出65%。
English
Finetuning specialized generative evaluators has emerged as a popular
paradigm to meet the increasing demand for scalable evaluation during both
training and test-time. However, recent work has largely focused on applying
new methodology, such as reinforcement learning (RL), to training evaluators,
shying away from large-scale, data-driven development. In this work, we focus
on data scaling, curating a set of 2.5M samples spanning five unique evaluation
tasks (pairwise, step-level, reference-free and reference-based verification,
and single rating) and multiple domains focused on reasoning evaluation. With
our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family
of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative
rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges
larger specialized RL-trained evaluators and FARE-20B sets the new standard for
open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static
benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers,
FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training,
FARE improves the downstream RL-trained model performance by up to 14.1% vs.
string-matching verifiers. When initialized from FARE, a continually-finetuned
FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.