TTT-Bench：一个通过简单新颖的井字棋类游戏评估推理能力的基准测试

摘要

大型推理模型（LRMs）已在包括奥林匹克数学问题在内的广泛任务中展现出卓越的推理能力，这证明了其具备复杂的推理技能。尽管众多推理基准集中于STEM领域，但LRMs在更广泛任务领域中正确推理的能力仍待深入探索。本研究引入了TTT-Bench，一个旨在通过四款人类自幼便能轻松解决的双人井字棋类游戏，评估LRMs基本战略、空间及逻辑推理能力的新基准。我们提出了一种简单且可扩展的程序化方法，用于生成TTT-Bench中可验证的双人游戏问题。尽管这些游戏对人类而言轻而易举，但它们要求模型推理对手意图及棋盘空间布局，以确保胜利。我们对一系列顶尖LRMs进行了评估，发现那些在复杂数学问题上表现优异的模型，在这些简单推理游戏中却屡屡受挫。进一步测试显示，相较于MATH 500和AIME 2024，我们评估的推理模型在TTT-Bench上的平均得分分别下降了41%和5%，且更大模型在更短的推理轨迹上表现更佳，而大多数模型在TTT-Bench简单新任务中的长期战略推理情境中表现挣扎。

English

Large reasoning models (LRMs) have demonstrated impressive reasoning capabilities across a broad range of tasks including Olympiad-level mathematical problems, indicating evidence of their complex reasoning abilities. While many reasoning benchmarks focus on the STEM domain, the ability of LRMs to reason correctly in broader task domains remains underexplored. In this work, we introduce TTT-Bench, a new benchmark that is designed to evaluate basic strategic, spatial, and logical reasoning abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games that humans can effortlessly solve from a young age. We propose a simple yet scalable programmatic approach for generating verifiable two-player game problems for TTT-Bench. Although these games are trivial for humans, they require reasoning about the intentions of the opponent, as well as the game board's spatial configurations, to ensure a win. We evaluate a diverse set of state-of-the-art LRMs, and discover that the models that excel at hard math problems frequently fail at these simple reasoning games. Further testing reveals that our evaluated reasoning models score on average downarrow 41\% \& downarrow 5\% lower on TTT-Bench compared to MATH 500 \& AIME 2024 respectively, with larger models achieving higher performance using shorter reasoning traces, where most of the models struggle on long-term strategic reasoning situations on simple and new TTT-Bench tasks.