PuzzlePlex: パズルを用いた推論と計画における基盤モデルのベンチマーキング

要旨

本研究は、基盤モデルの推論および計画能力と、複雑で動的な環境におけるそのスケーラビリティを調査する。我々は、これらの能力を多様なパズルを通じて評価するためのベンチマークであるPuzzlePlexを提案する。PuzzlePlexは、難易度の異なる決定論的および確率的ゲーム、ならびにシングルプレイヤーと2プレイヤーのシナリオを含む15種類のパズルで構成されている。PuzzlePlexフレームワークは、各ゲームに対する包括的な環境を提供し、基盤モデルの進化に伴い、より挑戦的なインスタンスを生成するための拡張性をサポートする。さらに、比較のためにカスタマイズされたゲームプレイ戦略を実装する。このベンチマークを基に、パフォーマンスを測定するための細かいメトリクスを開発し、指示ベースとコードベースの2つの設定における最先端の基盤モデルについて詳細な分析を行う。さらに、それらのスケーリング限界を体系的に調査する。我々の調査結果は、推論モデルが指示ベースの設定において他のモデルを上回る一方、コードベースの実行はより大きな課題を提示するが、スケーラブルで効率的な代替手段を提供することを示している。PuzzlePlexは、基盤モデルの推論、計画、および汎化能力の改善に向けたターゲットを絞った評価を可能にし、将来の進歩を導くものである。

English

This work investigates the reasoning and planning capabilities of foundation models and their scalability in complex, dynamic environments. We introduce PuzzlePlex, a benchmark designed to assess these capabilities through a diverse set of puzzles. PuzzlePlex consists of 15 types of puzzles, including deterministic and stochastic games of varying difficulty, as well as single-player and two-player scenarios. The PuzzlePlex framework provides a comprehensive environment for each game, and supports extensibility to generate more challenging instances as foundation models evolve. Additionally, we implement customized game-playing strategies for comparison. Building on this benchmark, we develop fine-grained metrics to measure performance and conduct an in-depth analysis of frontier foundation models across two settings: instruction-based and code-based. Furthermore, we systematically investigate their scaling limits. Our findings show that reasoning models outperform others in instruction-based settings, while code-based execution presents greater challenges but offers a scalable and efficient alternative. PuzzlePlex enables targeted evaluation and guides future improvements in reasoning, planning, and generalization for foundation models.

PuzzlePlex: パズルを用いた推論と計画における基盤モデルのベンチマーキング

PuzzlePlex: Benchmarking Foundation Models on Reasoning and Planning with Puzzles

要旨

Support