PuzzlePlex:基於謎題的推理與規劃能力基準測試
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
October 7, 2025
作者: Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha
cs.AI
摘要
本研究探討了基礎模型在複雜動態環境中的推理與規劃能力及其可擴展性。我們引入了PuzzlePlex,這是一個旨在通過多樣化謎題集來評估這些能力的基準。PuzzlePlex包含15種類型的謎題,涵蓋了不同難度的確定性與隨機性遊戲,以及單人與雙人場景。PuzzlePlex框架為每種遊戲提供了全面的環境,並支持可擴展性,以隨著基礎模型的發展生成更具挑戰性的實例。此外,我們實現了定制的遊戲策略以供比較。基於此基準,我們開發了細粒度的性能衡量指標,並在兩種設置下對前沿基礎模型進行了深入分析:基於指令的設置和基於代碼的設置。進一步地,我們系統地研究了它們的擴展極限。我們的研究結果表明,推理模型在基於指令的設置中表現優於其他模型,而基於代碼的執行雖然面臨更大挑戰,但提供了一種可擴展且高效的替代方案。PuzzlePlex實現了針對性評估,並為基礎模型在推理、規劃和泛化方面的未來改進提供了指導。
English
This work investigates the reasoning and planning capabilities of foundation
models and their scalability in complex, dynamic environments. We introduce
PuzzlePlex, a benchmark designed to assess these capabilities through a diverse
set of puzzles. PuzzlePlex consists of 15 types of puzzles, including
deterministic and stochastic games of varying difficulty, as well as
single-player and two-player scenarios. The PuzzlePlex framework provides a
comprehensive environment for each game, and supports extensibility to generate
more challenging instances as foundation models evolve. Additionally, we
implement customized game-playing strategies for comparison. Building on this
benchmark, we develop fine-grained metrics to measure performance and conduct
an in-depth analysis of frontier foundation models across two settings:
instruction-based and code-based. Furthermore, we systematically investigate
their scaling limits. Our findings show that reasoning models outperform others
in instruction-based settings, while code-based execution presents greater
challenges but offers a scalable and efficient alternative. PuzzlePlex enables
targeted evaluation and guides future improvements in reasoning, planning, and
generalization for foundation models.