知識集約型タスクにおいて、推論モデルにおけるテストタイムスケーリングはまだ有効ではない

要旨

テストタイムスケーリングは、モデルが長い推論チェーンを生成できるようにすることで推論時の計算量を増やし、多くの領域で強力なパフォーマンスを示しています。しかし、本研究では、このアプローチが、高い事実の正確性と低い虚構率が不可欠な知識集約型タスクにはまだ効果的でないことを示します。私たちは、12の推論モデルを用いて2つの知識集約型ベンチマークでテストタイムスケーリングの包括的な評価を行いました。その結果、テストタイムの計算量を増やしても一貫して精度が向上するわけではなく、多くの場合、虚構が増えることが明らかになりました。次に、拡張された推論が虚構の挙動にどのように影響するかを分析しました。その結果、虚構の減少は、多くの場合、事実の想起が改善されたためではなく、モデルがより考えた後に回答を控えることによるものであることがわかりました。逆に、一部のモデルでは、長い推論が以前は回答されなかった質問への試みを促し、その多くが虚構につながります。ケーススタディでは、拡張された推論が確証バイアスを誘発し、過信による虚構を引き起こす可能性があることが示されています。これらの制限にもかかわらず、思考を有効にすることは、思考を無効にする場合と比較して依然として有益であることが観察されました。コードとデータはhttps://github.com/XuZhao0/tts-knowledgeで公開されています。

English

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

知識集約型タスクにおいて、推論モデルにおけるテストタイムスケーリングはまだ有効ではない

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

要旨

Support