基礎自動評估器:擴展多任務生成式評估器訓練於推理核心領域
Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains
October 20, 2025
作者: Austin Xu, Xuan-Phi Nguyen, Yilun Zhou, Chien-Sheng Wu, Caiming Xiong, Shafiq Joty
cs.AI
摘要
精調專業生成評估器已成為一種流行範式,以滿足訓練和測試期間對可擴展評估日益增長的需求。然而,近期研究主要集中在應用新方法(如強化學習,RL)來訓練評估器,而避開了大規模、數據驅動的開發。在本研究中,我們專注於數據擴展,策劃了一組包含250萬個樣本的數據集,涵蓋五種獨特的評估任務(成對、步驟級、無參考和基於參考的驗證,以及單一評分)以及多個專注於推理評估的領域。利用這些數據,我們訓練了基礎自動推理評估器(FARE),這是一系列擁有80億和200億(其中36億為活躍)參數的評估器,採用了一種簡單的迭代拒絕採樣監督精調(SFT)方法。FARE-8B挑戰了更大的專用RL訓練評估器,而FARE-20B則為開源評估器設定了新標準,超越了專用的700億+參數評估器。除了靜態基準測試外,我們還在實際任務中評估了FARE:作為推理時的重排序器,FARE-20B在MATH上達到了近乎預言機的性能。作為RL訓練中的驗證器,FARE相比字符串匹配驗證器,將下游RL訓練模型的性能提升了高達14.1%。當從FARE初始化時,持續精調的FARE-Code在評估測試用例質量方面,比gpt-oss-20B高出65%。
English
Finetuning specialized generative evaluators has emerged as a popular
paradigm to meet the increasing demand for scalable evaluation during both
training and test-time. However, recent work has largely focused on applying
new methodology, such as reinforcement learning (RL), to training evaluators,
shying away from large-scale, data-driven development. In this work, we focus
on data scaling, curating a set of 2.5M samples spanning five unique evaluation
tasks (pairwise, step-level, reference-free and reference-based verification,
and single rating) and multiple domains focused on reasoning evaluation. With
our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family
of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative
rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges
larger specialized RL-trained evaluators and FARE-20B sets the new standard for
open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static
benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers,
FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training,
FARE improves the downstream RL-trained model performance by up to 14.1% vs.
string-matching verifiers. When initialized from FARE, a continually-finetuned
FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.