PuzzlePlex:基于谜题推理与规划的基础模型基准测试
PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles
October 7, 2025
作者: Yitao Long, Yuru Jiang, Hongjun Liu, Yilun Zhao, Jingchen Sun, Yiqiu Shen, Chen Zhao, Arman Cohan, Dennis Shasha
cs.AI
摘要
本研究深入探讨了基础模型在复杂动态环境中的推理与规划能力及其可扩展性。我们引入了PuzzlePlex这一基准测试,旨在通过一系列多样化的谜题来评估这些能力。PuzzlePlex包含15种类型的谜题,涵盖不同难度的确定性与随机性游戏,以及单人及双人场景。该框架为每类游戏提供了全面的环境支持,并具备可扩展性,能够随着基础模型的演进生成更具挑战性的实例。此外,我们还实现了定制化的游戏策略以供对比。基于此基准,我们开发了细粒度的性能度量指标,并在指令驱动与代码执行两种设置下,对前沿基础模型进行了深入分析。同时,我们系统地探究了它们的扩展极限。研究发现,在指令驱动设置下,推理模型表现优异;而代码执行虽面临更大挑战,却提供了一种可扩展且高效的替代方案。PuzzlePlex实现了针对性评估,并为未来基础模型在推理、规划及泛化能力上的改进提供了指导。
English
This work investigates the reasoning and planning capabilities of foundation
models and their scalability in complex, dynamic environments. We introduce
PuzzlePlex, a benchmark designed to assess these capabilities through a diverse
set of puzzles. PuzzlePlex consists of 15 types of puzzles, including
deterministic and stochastic games of varying difficulty, as well as
single-player and two-player scenarios. The PuzzlePlex framework provides a
comprehensive environment for each game, and supports extensibility to generate
more challenging instances as foundation models evolve. Additionally, we
implement customized game-playing strategies for comparison. Building on this
benchmark, we develop fine-grained metrics to measure performance and conduct
an in-depth analysis of frontier foundation models across two settings:
instruction-based and code-based. Furthermore, we systematically investigate
their scaling limits. Our findings show that reasoning models outperform others
in instruction-based settings, while code-based execution presents greater
challenges but offers a scalable and efficient alternative. PuzzlePlex enables
targeted evaluation and guides future improvements in reasoning, planning, and
generalization for foundation models.