這是你的最終答案嗎?測試時調整提升選擇性問答表現
Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering
February 19, 2025
作者: William Jurayj, Jeffrey Cheng, Benjamin Van Durme
cs.AI
摘要
擴大大型語言模型在測試時的計算資源,已展現出在推理基準測試上的卓越表現。然而,現有的測試規模評估強烈假設推理系統應對任何提出的問題都給出答案。這忽略了模型對其答案是否自信,以及是否總是適合提供回應的考量。為解決這些問題,我們在推理過程中提取置信度分數,用於閾值化模型回應。我們發現,在推理時增加計算預算不僅幫助模型更正確地回答更多問題,還提高了對正確回應的信心。接著,我們通過考慮非零回應風險的設定,擴展了當前評估中零風險回應的範式,並建議在這些設定下報告評估結果的方法。
English
Scaling the test-time compute of large language models has demonstrated
impressive performance on reasoning benchmarks. However, existing evaluations
of test-time scaling make the strong assumption that a reasoning system should
always give an answer to any question provided. This overlooks concerns about
whether a model is confident in its answer, and whether it is appropriate to
always provide a response. To address these concerns, we extract confidence
scores during reasoning for thresholding model responses. We find that
increasing compute budget at inference time not only helps models answer more
questions correctly, but also increases confidence in correct responses. We
then extend the current paradigm of zero-risk responses during evaluation by
considering settings with non-zero levels of response risk, and suggest a
recipe for reporting evaluations under these settings.Summary
AI-Generated Summary