基盤的評価自動化システム：推論中心領域におけるマルチタスク生成評価モデルのスケーリング

要旨

専門的な生成評価モデルのファインチューニングは、トレーニング時およびテスト時のスケーラブルな評価に対する需要の高まりに対応するための一般的なパラダイムとして登場している。しかし、最近の研究は主に強化学習（RL）などの新しい方法論を評価モデルのトレーニングに適用することに焦点を当てており、大規模なデータ駆動型の開発からは遠ざかっている。本研究では、データスケーリングに焦点を当て、5つのユニークな評価タスク（ペアワイズ、ステップレベル、リファレンスフリーおよびリファレンスベースの検証、単一評価）と推論評価に焦点を当てた複数のドメインにわたる250万サンプルのデータセットをキュレーションした。このデータを用いて、8Bおよび20B（アクティブな3.6B）パラメータの評価モデルファミリーであるFoundational Automatic Reasoning Evaluators（FARE）を、シンプルな反復的拒否サンプリングによる教師ありファインチューニング（SFT）アプローチでトレーニングした。FARE-8Bは、より大規模なRLトレーニングされた専門評価モデルに挑戦し、FARE-20Bはオープンソースの評価モデルの新たな基準を設定し、専門的な70B+評価モデルを凌駕した。静的ベンチマークを超えて、FAREを現実世界のタスクで評価した：推論時のリランカーとして、FARE-20BはMATHにおいてほぼオラクル性能を達成した。RLトレーニングにおける検証器として、FAREは下流のRLトレーニングされたモデルの性能を文字列マッチング検証器に対して最大14.1%向上させた。FAREから初期化された継続的ファインチューニングされたFARE-Codeは、テストケースの品質評価においてgpt-oss-20Bを65%上回った。

English

Finetuning specialized generative evaluators has emerged as a popular paradigm to meet the increasing demand for scalable evaluation during both training and test-time. However, recent work has largely focused on applying new methodology, such as reinforcement learning (RL), to training evaluators, shying away from large-scale, data-driven development. In this work, we focus on data scaling, curating a set of 2.5M samples spanning five unique evaluation tasks (pairwise, step-level, reference-free and reference-based verification, and single rating) and multiple domains focused on reasoning evaluation. With our data, we train Foundational Automatic Reasoning Evaluators (FARE), a family of 8B and 20B (with 3.6B active) parameter evaluators, with a simple iterative rejection-sampling supervised finetuning (SFT) approach. FARE-8B challenges larger specialized RL-trained evaluators and FARE-20B sets the new standard for open-source evaluators, surpassing specialized 70B+ evaluators. Beyond static benchmarks, we evaluate FARE in real-world tasks: As inference-time rerankers, FARE-20B achieves near-oracle performance on MATH. As verifiers in RL training, FARE improves the downstream RL-trained model performance by up to 14.1% vs. string-matching verifiers. When initialized from FARE, a continually-finetuned FARE-Code outperforms gpt-oss-20B by 65% on evaluating test-case quality.

基盤的評価自動化システム：推論中心領域におけるマルチタスク生成評価モデルのスケーリング

Foundational Automatic Evaluators: Scaling Multi-Task Generative Evaluator Training for Reasoning-Centric Domains

要旨

Support