TTT-Bench:一個評估推理能力的基準測試,採用簡單且新穎的井字棋風格遊戲
TTT-Bench: A Benchmark for Evaluating Reasoning Ability with Simple and Novel Tic-Tac-Toe-style Games
June 11, 2025
作者: Prakamya Mishra, Jiang Liu, Jialian Wu, Xiaodong Yu, Zicheng Liu, Emad Barsoum
cs.AI
摘要
大型推理模型(LRMs)在包括奧林匹克級數學問題在內的廣泛任務中展現了令人印象深刻的推理能力,這表明它們具備複雜的推理能力。儘管許多推理基準集中在STEM領域,但LRMs在更廣泛任務領域中正確推理的能力仍未得到充分探索。在本研究中,我們引入了TTT-Bench,這是一個新的基準,旨在通過一套四種雙人井字棋風格遊戲來評估LRMs的基本策略、空間和邏輯推理能力,這些遊戲人類從小就能輕鬆解決。我們提出了一種簡單但可擴展的程序化方法,用於生成TTT-Bench的可驗證雙人遊戲問題。儘管這些遊戲對人類來說微不足道,但它們需要推理對手的意圖以及遊戲棋盤的空間配置,以確保勝利。我們評估了一系列最先進的LRMs,發現那些在難題數學問題上表現出色的模型在這些簡單的推理遊戲中經常失敗。進一步測試顯示,與MATH 500和AIME 2024相比,我們評估的推理模型在TTT-Bench上的平均得分分別下降了41%和5%,其中較大的模型在較短的推理軌跡上表現更好,而大多數模型在簡單和新穎的TTT-Bench任務中的長期策略推理情境中表現不佳。
English
Large reasoning models (LRMs) have demonstrated impressive reasoning
capabilities across a broad range of tasks including Olympiad-level
mathematical problems, indicating evidence of their complex reasoning
abilities. While many reasoning benchmarks focus on the STEM domain, the
ability of LRMs to reason correctly in broader task domains remains
underexplored. In this work, we introduce TTT-Bench, a new benchmark
that is designed to evaluate basic strategic, spatial, and logical reasoning
abilities in LRMs through a suite of four two-player Tic-Tac-Toe-style games
that humans can effortlessly solve from a young age. We propose a simple yet
scalable programmatic approach for generating verifiable two-player game
problems for TTT-Bench. Although these games are trivial for humans, they
require reasoning about the intentions of the opponent, as well as the game
board's spatial configurations, to ensure a win. We evaluate a diverse set of
state-of-the-art LRMs, and discover that the models that excel at hard
math problems frequently fail at these simple reasoning games. Further testing
reveals that our evaluated reasoning models score on average downarrow 41\%
\& downarrow 5\% lower on TTT-Bench compared to MATH 500 \& AIME 2024
respectively, with larger models achieving higher performance using shorter
reasoning traces, where most of the models struggle on long-term strategic
reasoning situations on simple and new TTT-Bench tasks.