HeroBench:虚拟世界中长程规划与结构化推理的基准测试平台
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
August 18, 2025
作者: Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette
cs.AI
摘要
大型语言模型(LLMs)在数学和编程等分步推理任务中展现了卓越的能力,但在需要长期、结构化且相互依赖行动序列的长远规划方面,其熟练程度仍待深入探索。现有基准测试通常通过抽象或低维算法任务来评估LLMs,未能捕捉现实规划环境的复杂性。为此,我们引入了HeroBench,这是一个专为评估复杂RPG风格虚拟世界中的长远规划与结构化推理而设计的新颖基准。HeroBench提供了一个精心构建的任务数据集,涵盖多种难度级别,一个用于执行和验证智能体计划的模拟环境,以及用于评估模型性能的详细分析工具。这些任务挑战模型制定战略计划、高效收集资源、掌握必要技能、制作装备并击败对手,反映了实际场景中的层次依赖与约束。我们对包括GPT-5系列在内的25个最先进的LLMs进行了广泛评估,涵盖了开源和专有模型,揭示了在传统推理基准中罕见的显著性能差异。详细的错误分析进一步揭示了当前模型在生成稳健高层计划和可靠执行结构化行动方面的具体弱点。因此,HeroBench不仅显著推进了LLM推理能力的评估,还为未来在虚拟环境中进行高级自主规划研究提供了一个灵活、可扩展的基础。
English
Large language models (LLMs) have shown remarkable capabilities in isolated
step-by-step reasoning tasks such as mathematics and programming, but their
proficiency in long-horizon planning, where solutions require extended,
structured sequences of interdependent actions, remains underexplored. Existing
benchmarks typically assess LLMs through abstract or low-dimensional
algorithmic tasks, failing to capture the complexity of realistic planning
environments. We introduce HeroBench, a novel benchmark designed specifically
to evaluate long-horizon planning and structured reasoning within complex
RPG-inspired virtual worlds. HeroBench provides a rigorously constructed
dataset of tasks covering a wide range of difficulties, a simulated environment
to execute and validate agent plans, and detailed analytical tools for
evaluating model performance. Tasks challenge models to formulate strategic
plans, efficiently gather resources, master necessary skills, craft equipment,
and defeat adversaries, reflecting practical scenarios' layered dependencies
and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning
both open-source and proprietary models, including the GPT-5 family, reveals
substantial performance disparities rarely observed in conventional reasoning
benchmarks. Detailed error analysis further uncovers specific weaknesses in
current models' abilities to generate robust high-level plans and reliably
execute structured actions. HeroBench thus not only significantly advances the
evaluation of LLM reasoning but also provides a flexible, scalable foundation
for future research into advanced, autonomous planning in virtual environments.