xVerify:推理模型評估的高效答案驗證器
xVerify: Efficient Answer Verifier for Reasoning Model Evaluations
April 14, 2025
作者: Ding Chen, Qingchen Yu, Pengyuan Wang, Wentao Zhang, Bo Tang, Feiyu Xiong, Xinchi Li, Minchuan Yang, Zhiyu Li
cs.AI
摘要
隨著OpenAI推出o1模型,採用慢思考策略的推理模型逐漸嶄露頭角。由於這類模型生成的回應通常包含複雜的推理、中間步驟和自我反思,現有的評估方法往往力有未逮。它們難以判斷大型語言模型(LLM)的輸出是否真正等同於參考答案,也難以從冗長複雜的回應中識別並提取最終答案。為解決這一問題,我們提出了xVerify,一種用於推理模型評估的高效答案驗證器。xVerify在等價判斷方面展現出強大的能力,使其能夠有效判定推理模型產生的答案是否與各類客觀問題的參考答案等價。為了訓練和評估xVerify,我們構建了VAR數據集,通過收集多個LLM在不同數據集上生成的問答對,利用多種推理模型以及專為推理模型評估設計的挑戰性評估集。我們採用多輪標註流程以確保標籤的準確性。基於VAR數據集,我們訓練了多個不同規模的xVerify模型。在測試集和泛化集上進行的評估實驗中,所有xVerify模型的總體F1分數和準確率均超過95%。值得注意的是,最小版本的xVerify-0.5B-I在除GPT-4o之外的所有評估方法中表現最佳,而xVerify-3B-Ib在整體性能上超越了GPT-4o。這些結果驗證了xVerify的有效性和泛化能力。
English
With the release of the o1 model by OpenAI, reasoning models adopting slow
thinking strategies have gradually emerged. As the responses generated by such
models often include complex reasoning, intermediate steps, and
self-reflection, existing evaluation methods are often inadequate. They
struggle to determine whether the LLM output is truly equivalent to the
reference answer, and also have difficulty identifying and extracting the final
answer from long, complex responses. To address this issue, we propose xVerify,
an efficient answer verifier for reasoning model evaluations. xVerify
demonstrates strong capability in equivalence judgment, enabling it to
effectively determine whether the answers produced by reasoning models are
equivalent to reference answers across various types of objective questions. To
train and evaluate xVerify, we construct the VAR dataset by collecting
question-answer pairs generated by multiple LLMs across various datasets,
leveraging multiple reasoning models and challenging evaluation sets designed
specifically for reasoning model assessment. A multi-round annotation process
is employed to ensure label accuracy. Based on the VAR dataset, we train
multiple xVerify models of different scales. In evaluation experiments
conducted on both the test set and generalization set, all xVerify models
achieve overall F1 scores and accuracy exceeding 95\%. Notably, the smallest
variant, xVerify-0.5B-I, outperforms all evaluation methods except GPT-4o,
while xVerify-3B-Ib surpasses GPT-4o in overall performance. These results
validate the effectiveness and generalizability of xVerify.Summary
AI-Generated Summary