THOUGHTTERMINATOR: 추론 모델에서의 과도한 사고 벤치마킹, 보정 및 완화

초록

추론 모델은 전통적인 언어 모델이 어려워하는 복잡한 과제에서 인상적인 성능을 보여주고 있습니다. 그러나 많은 모델이 과도한 사고(overthinking) 문제에 시달리고 있는데, 이는 질문의 정확도를 높이지 못하는 불필요한 토큰을 대량으로 생성하는 현상을 말합니다. 우리는 문제 난이도의 근사적 측정 방법을 소개하고, 문제 난이도와 최적 토큰 사용량 사이에 명확한 관계가 존재함을 입증합니다. 또한 다양한 추론 모델이 최적 토큰 수를 효율적으로 할당하는 측면에서 얼마나 잘 보정(calibrated)되어 있는지 평가합니다. 연구 결과, 일반적으로 추론 모델은 특히 쉬운 문제에서 보정이 잘 되어 있지 않음을 발견했습니다. 쉬운 질문에 대한 보정을 평가하기 위해, 우리는 매우 간단한 수학, 추론, 코드 및 작업 문제로 구성된 DUMB500 데이터셋을 도입하고, 추론 모델을 이 간단한 예제와 기존 최첨단 벤치마크의 극도로 어려운 예제에 대해 동일한 작업 영역에서 공동으로 평가합니다. 마지막으로, 우리는 학습이 필요 없는 블랙박스 디코딩 기법인 THOUGHTTERMINATOR를 소개하며, 이는 추론 모델의 보정을 크게 개선합니다.

English

Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.

THOUGHTTERMINATOR: 추론 모델에서의 과도한 사고 벤치마킹, 보정 및 완화

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

초록

Support