無需博士知識：大型語言模型的推理挑戰

摘要

現有的前沿模型基準通常測試專業的「博士級」知識，對非專業人士來說難以理解。相較之下，我們提出了一個基於 NPR Sunday Puzzle Challenge 的基準，只需要一般性知識。我們的基準對人類和模型都具有挑戰性，然而正確解答容易驗證，模型的錯誤也容易辨識。我們的研究揭示了現有基準中未曾顯示的能力差距：OpenAI o1 在測試專業知識的基準上與其他推理模型相比表現顯著優越。此外，我們對推理輸出的分析揭示了新型失敗。例如，DeepSeek R1 在提供自知錯誤答案之前經常會說「我放棄了」。R1 的輸出也可能非常「不確定」，在罕見情況下，它可能不會「思考完畢」，這表明需要一種推理時技術在達到上下文窗口限制之前進行「結束」。我們還量化了使用 R1 和 Gemini Thinking 進行更長推理的效果，以確定超過某一點後進行更多推理不太可能提高我們基準的準確性。

English

Existing benchmarks for frontier models often test specialized, ``PhD-level'' knowledge that is difficult for non-experts to grasp. In contrast, we present a benchmark based on the NPR Sunday Puzzle Challenge that requires only general knowledge. Our benchmark is challenging for both humans and models, however correct solutions are easy to verify, and models' mistakes are easy to spot. Our work reveals capability gaps that are not evident in existing benchmarks: OpenAI o1 significantly outperforms other reasoning models that are on par on benchmarks that test specialized knowledge. Furthermore, our analysis of reasoning outputs uncovers new kinds of failures. DeepSeek R1, for instance, often concedes with ``I give up'' before providing an answer that it knows is wrong. R1 can also be remarkably ``uncertain'' in its output and in rare cases, it does not ``finish thinking,'' which suggests the need for an inference-time technique to ``wrap up'' before the context window limit is reached. We also quantify the effectiveness of reasoning longer with R1 and Gemini Thinking to identify the point beyond which more reasoning is unlikely to improve accuracy on our benchmark.

無需博士知識：大型語言模型的推理挑戰

PhD Knowledge Not Required: A Reasoning Challenge for Large Language Models

摘要

Support