PhDの知識は不要: 大規模言語モデルの推論チャレンジ

要旨

既存のフロンティアモデルのためのベンチマークは、一般の専門外の人々にとって理解が難しい「博士レベル」の特殊な知識をテストすることが多い。それに対して、私たちはNPRサンデーパズルチャレンジに基づくベンチマークを提案する。このベンチマークは一般的な知識のみを必要とし、人間とモデルの両方にとって挑戦的であるが、正しい解決策は簡単に検証でき、モデルの間違いも容易に見つけることができる。私たちの研究は、既存のベンチマークでは明らかにならない能力のギャップを明らかにしている。例えば、OpenAI o1は、専門知識をテストするベンチマークで同等の理論モデルよりも著しく優れたパフォーマンスを発揮している。さらに、理論の出力の分析により、新しい種類の失敗が明らかになっている。例えば、DeepSeek R1は、しばしば「諦める」と述べた後に、間違っていることを知りながら回答を提供することがある。R1は出力において驚くほど「不確か」であり、稀なケースでは「思考を終えない」こともあり、文脈ウィンドウの制限に達する前に「まとめる」ための推論時のテクニックが必要であることを示唆している。また、R1とGemini Thinkingを用いたより長い推論の効果を定量化し、ベンチマークの精度向上には推論をさらに行うことが有益である限界点を特定している。

English

Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.