지식 집약적 작업에서 추론 모델의 테스트 시간 스케일링은 아직 효과적이지 않다

초록

테스트 시간 스케일링은 모델이 긴 추론 체인을 생성할 수 있도록 함으로써 추론 시간 계산을 증가시키며, 다양한 도메인에서 강력한 성능을 보여왔습니다. 그러나 본 연구에서는 이러한 접근 방식이 높은 사실적 정확도와 낮은 환각(hallucination) 발생률이 필수적인 지식 집약적 작업에는 아직 효과적이지 않음을 보여줍니다. 우리는 두 가지 지식 집약적 벤치마크에서 12개의 추론 모델을 사용하여 테스트 시간 스케일링에 대한 포괄적인 평가를 수행했습니다. 그 결과, 테스트 시간 계산을 증가시키는 것이 정확도를 일관되게 향상시키지 못하며, 많은 경우 오히려 더 많은 환각을 유발한다는 사실을 발견했습니다. 이후, 확장된 추론이 환각 행동에 미치는 영향을 분석했습니다. 우리는 환각 감소가 종종 사실적 회상의 개선보다는 모델이 더 깊이 생각한 후 답변을 자제하는 선택에서 비롯된다는 것을 발견했습니다. 반대로, 일부 모델의 경우 더 긴 추론은 이전에 답변하지 않았던 질문에 대한 시도를 촉진하며, 이 중 많은 경우 환각으로 이어졌습니다. 사례 연구는 확장된 추론이 확인 편향을 유발하여 과도하게 확신에 찬 환각을 초래할 수 있음을 보여줍니다. 이러한 한계에도 불구하고, 우리는 생각을 활성화하는 것이 생각을 하지 않는 것에 비해 여전히 유리하다는 점을 관찰했습니다. 코드와 데이터는 https://github.com/XuZhao0/tts-knowledge에서 확인할 수 있습니다.

English

Test-time scaling increases inference-time computation by allowing models to generate long reasoning chains, and has shown strong performance across many domains. However, in this work, we show that this approach is not yet effective for knowledge-intensive tasks, where high factual accuracy and low hallucination rates are essential. We conduct a comprehensive evaluation of test-time scaling using 12 reasoning models on two knowledge-intensive benchmarks. Our results reveal that increasing test-time computation does not consistently improve accuracy and, in many cases, it even leads to more hallucinations. We then analyze how extended reasoning affects hallucination behavior. We find that reduced hallucinations often result from the model choosing to abstain after thinking more, rather than from improved factual recall. Conversely, for some models, longer reasoning encourages attempts on previously unanswered questions, many of which result in hallucinations. Case studies show that extended reasoning can induce confirmation bias, leading to overconfident hallucinations. Despite these limitations, we observe that compared to non-thinking, enabling thinking remains beneficial. Code and data are available at https://github.com/XuZhao0/tts-knowledge

지식 집약적 작업에서 추론 모델의 테스트 시간 스케일링은 아직 효과적이지 않다

Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

초록

Support