HeroBench: 仮想世界における長期計画と構造化推論のためのベンチマーク

要旨

大規模言語モデル（LLM）は、数学やプログラミングなどの個別のステップバイステップ推論タスクにおいて顕著な能力を示していますが、解決策が相互依存する長期的で構造化された一連の行動を必要とする長期的計画（long-horizon planning）における熟練度はまだ十分に探求されていません。既存のベンチマークは、通常、抽象的または低次元のアルゴリズムタスクを通じてLLMを評価しており、現実的な計画環境の複雑さを捉えることができていません。本研究では、複雑なRPG風の仮想世界内での長期的計画と構造化推論を評価するために特別に設計された新しいベンチマーク「HeroBench」を紹介します。HeroBenchは、幅広い難易度をカバーする厳密に構築されたタスクデータセット、エージェントの計画を実行および検証するためのシミュレーション環境、モデルのパフォーマンスを評価するための詳細な分析ツールを提供します。タスクは、戦略的な計画を立て、効率的にリソースを収集し、必要なスキルを習得し、装備を製作し、敵を倒すことをモデルに要求し、実践的なシナリオの階層的な依存関係と制約を反映しています。GPT-5ファミリーを含むオープンソースおよびプロプライエタリモデルにわたる25の最先端LLMの広範な評価により、従来の推論ベンチマークではほとんど見られない大幅なパフォーマンスの差異が明らかになりました。詳細なエラー分析により、現在のモデルが堅牢な高レベルの計画を生成し、構造化された行動を確実に実行する能力における特定の弱点がさらに明らかになりました。したがって、HeroBenchはLLM推論の評価を大幅に進めるだけでなく、仮想環境における高度で自律的な計画の将来の研究のための柔軟でスケーラブルな基盤を提供します。

English

Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.

HeroBench: 仮想世界における長期計画と構造化推論のためのベンチマーク

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

要旨

Support