THOUGHTTERMINATOR: 推論モデルにおける過剰思考のベンチマーキング、キャリブレーション、および軽減

要旨

推論モデルは、従来の言語モデルが苦手とする困難なタスクにおいて印象的な性能を発揮しています。しかし、多くのモデルは「過剰思考」の問題に悩まされています。つまり、質問の精度を向上させない不必要な大量のトークンを生成してしまうのです。本研究では、問題レベルの難易度を近似的に測定する手法を導入し、問題の難易度と最適なトークン使用量との間に明確な関係が存在することを示します。さらに、さまざまな推論モデルが最適なトークン数を効率的に割り当てる点でどれだけ適切に調整されているかを評価します。その結果、一般的に推論モデルは特に簡単な問題において、調整が不十分であることがわかりました。簡単な質問に対する調整を評価するために、極めて簡単な数学、推論、コード、およびタスク問題からなるデータセット「DUMB500」を導入し、これらの単純な例と既存の最先端ベンチマークから得られた極めて難しい例を同じタスク領域で同時に評価します。最後に、トレーニング不要のブラックボックスデコード技術「THOUGHTTERMINATOR」を導入し、推論モデルの調整を大幅に改善することを示します。

English

Reasoning models have demonstrated impressive performance on difficult tasks that traditional language models struggle at. However, many are plagued with the problem of overthinking--generating large amounts of unnecessary tokens which don't improve accuracy on a question. We introduce approximate measures of problem-level difficulty and demonstrate that a clear relationship between problem difficulty and optimal token spend exists, and evaluate how well calibrated a variety of reasoning models are in terms of efficiently allocating the optimal token count. We find that in general, reasoning models are poorly calibrated, particularly on easy problems. To evaluate calibration on easy questions we introduce DUMB500, a dataset of extremely easy math, reasoning, code, and task problems, and jointly evaluate reasoning model on these simple examples and extremely difficult examples from existing frontier benchmarks on the same task domain. Finally, we introduce THOUGHTTERMINATOR, a training-free black box decoding technique that significantly improves reasoning model calibration.

THOUGHTTERMINATOR: 推論モデルにおける過剰思考のベンチマーキング、キャリブレーション、および軽減

THOUGHTTERMINATOR: Benchmarking, Calibrating, and Mitigating Overthinking in Reasoning Models

要旨

Support