TurtleBench：通过现实世界的是/否谜题评估顶级语言模型

摘要

随着大型语言模型（LLMs）的应用范围扩大，对可靠评估的需求也在增加。现有的LLM评估基准主要依赖静态数据集，这使得在模型与用户动态交互中评估模型性能变得具有挑战性。此外，这些基准通常依赖于特定的背景知识，使衡量模型逻辑推理能力变得复杂。基于强模型或人工努力的其他动态评估方法可能会引入偏见，并带来高成本和时间需求，从而阻碍大规模应用。为解决这些问题，我们提出了TurtleBench。TurtleBench从我们开发的在线Turtle Soup Puzzle平台收集了真实用户猜测。这种方法允许相对动态地生成评估数据集，减少模型作弊的风险，同时更贴近真实用户对推理能力的需求，从而提高评估的可靠性。TurtleBench包括1,532个用户猜测以及注释后的猜测正确性。利用这一数据集，我们对当今最先进的九个LLMs进行了全面评估。值得注意的是，OpenAI o1系列模型在这些评估中并未取得领先的结果。我们提出了几个进一步研究的假设，比如“o1的潜在推理利用了琐碎的Chain-of-Thought（CoT）技术”和“增加CoT长度不仅提供推理好处，还带来噪声成本”。

English

As the application of Large Language Models (LLMs) expands, the demand for reliable evaluations increases. Existing LLM evaluation benchmarks primarily rely on static datasets, making it challenging to assess model performance in dynamic interactions with users. Moreover, these benchmarks often depend on specific background knowledge, complicating the measurement of a model's logical reasoning capabilities. Other dynamic evaluation methods based on strong models or manual efforts may introduce biases and incur high costs and time demands, hindering large-scale application. To address these issues, we propose TurtleBench. TurtleBench collects real user guesses from our online Turtle Soup Puzzle platform that we developed. This approach allows for the relatively dynamic generation of evaluation datasets, mitigating the risk of model cheating while aligning assessments more closely with genuine user needs for reasoning capabilities, thus enhancing the reliability of evaluations. TurtleBench includes 1,532 user guesses along with the correctness of guesses after annotation. Using this dataset, we thoroughly evaluated nine of the most advanced LLMs available today. Notably, the OpenAI o1 series models did not achieve leading results in these evaluations. We propose several hypotheses for further research, such as "the latent reasoning of o1 utilizes trivial Chain-of-Thought (CoT) techniques" and "increasing CoT length not only provides reasoning benefits but also incurs noise costs."

TurtleBench：通过现实世界的是/否谜题评估顶级语言模型

TurtleBench: Evaluating Top Language Models via Real-World Yes/No Puzzles

摘要

Support