それが最終回答ですか？テスト時のスケーリングが選択的質問応答を改善する

要旨

大規模言語モデルの推論時の計算リソースをスケーリングすることは、推論ベンチマークにおいて印象的な性能を示すことが実証されています。しかし、既存の推論時スケーリング評価では、推論システムが与えられたあらゆる質問に対して常に回答を提供すべきであるという強い仮定が置かれています。これでは、モデルが自身の回答に自信を持っているかどうか、また常に回答を提供することが適切かどうかといった懸念が見落とされています。これらの懸念に対処するため、我々は推論中に信頼度スコアを抽出し、モデルの回答を閾値処理します。推論時の計算予算を増やすことで、モデルがより多くの質問に正しく回答できるだけでなく、正しい回答に対する自信も高まることがわかりました。さらに、我々は評価時のゼロリスク回答という現在のパラダイムを拡張し、非ゼロの回答リスクレベルを考慮した設定を検討し、これらの設定下での評価報告のための方法論を提案します。

English

Scaling the test-time compute of large language models has demonstrated impressive performance on reasoning benchmarks. However, existing evaluations of test-time scaling make the strong assumption that a reasoning system should always give an answer to any question provided. This overlooks concerns about whether a model is confident in its answer, and whether it is appropriate to always provide a response. To address these concerns, we extract confidence scores during reasoning for thresholding model responses. We find that increasing compute budget at inference time not only helps models answer more questions correctly, but also increases confidence in correct responses. We then extend the current paradigm of zero-risk responses during evaluation by considering settings with non-zero levels of response risk, and suggest a recipe for reporting evaluations under these settings.

それが最終回答ですか？テスト時のスケーリングが選択的質問応答を改善する

Is That Your Final Answer? Test-Time Scaling Improves Selective Question Answering

要旨

Support