CompassVerifier: LLM評価と結果報酬のための統合的かつ堅牢な検証システム

要旨

回答検証は、大規模言語モデル（LLM）の非構造化出力を標準回答と照合して評価するだけでなく、LLMの最適化を導く報酬モデルとしても重要な役割を果たします。ほとんどの評価フレームワークは、正規化された照合に依存するか、一般的なLLMを回答検証に使用しており、これには正規表現ルールや評価プロンプトの広範で反復的なカスタマイズが必要です。現在の方法論には2つの根本的な制限があります：1）異なるLLM間での検証能力を体系的に評価する包括的なベンチマークの欠如、2）検証器開発の初期段階であり、既存のアプローチは複雑なエッジケースを処理する堅牢性と異なるドメイン間での汎用性の両方を欠いています。本研究では、評価と結果報酬のための正確で堅牢な軽量検証器モデル「CompassVerifier」を開発しました。これは、数学、知識、多様な推論タスクにわたるマルチドメイン能力を示し、複数のサブ問題、数式、シーケンス回答を含むさまざまな回答タイプを処理し、異常/無効な回答を効果的に識別する能力を備えています。また、複数のデータソースから収集したモデル出力を含む「VerifierBench」ベンチマークを導入し、メタエラーパターンの手動分析を通じて強化することでCompassVerifierを向上させました。CompassVerifierとVerifierBenchが、回答検証、評価プロトコル、強化学習研究を促進することを期待しています。コードとデータセットはhttps://github.com/open-compass/CompassVerifierで公開されています。

English

Answer verification is crucial not only for evaluating large language models (LLMs) by matching their unstructured outputs against standard answers, but also serves as the reward model to guide LLM optimization. Most evaluation frameworks rely on regularized matching or employ general LLMs for answer verification, which demands extensive, repetitive customization for regex rules or evaluation prompts. Two fundamental limitations persist in current methodologies: 1) the absence of comprehensive benchmarks that systematically evaluate verification capabilities across different LLMs; and 2) the nascent stage of verifier development, where existing approaches lack both the robustness to handle complex edge cases and the generalizability across different domains. In this work, we develop CompassVerifier, an accurate and robust lightweight verifier model for evaluation and outcome reward. It demonstrates multi-domain competency spanning math, knowledge, and diverse reasoning tasks, with the capability to process various answer types, including multi-subproblems, formulas, and sequence answers, while effectively identifying abnormal/invalid responses. We introduce VerifierBench benchmark comprising model outputs collected from multiple data sources, augmented through manual analysis of metaerror patterns to enhance CompassVerifier. We anticipate that CompassVerifier and VerifierBench will facilitate answer verification, evaluation protocols, and reinforcement learning research. Code and dataset are available at https://github.com/open-compass/CompassVerifier.

CompassVerifier: LLM評価と結果報酬のための統合的かつ堅牢な検証システム

CompassVerifier: A Unified and Robust Verifier for LLMs Evaluation and Outcome Reward

要旨

Support