HeroBench:虛擬世界中長期規劃與結構化推理的基準測試平台
HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds
August 18, 2025
作者: Petr Anokhin, Roman Khalikov, Stefan Rebrikov, Viktor Volkov, Artyom Sorokin, Vincent Bissonnette
cs.AI
摘要
大型語言模型(LLMs)在數學和編程等逐步推理任務中展現了卓越的能力,但在需要長時間、結構化且相互依賴的行動序列的長遠規劃任務中,其熟練程度仍未被充分探索。現有的基準測試通常通過抽象或低維度的算法任務來評估LLMs,未能捕捉到現實規劃環境的複雜性。我們引入了HeroBench,這是一個專門設計的新基準,用於評估在複雜的RPG風格虛擬世界中的長遠規劃和結構化推理能力。HeroBench提供了一個嚴格構建的任務數據集,涵蓋了多種難度級別,一個用於執行和驗證代理計劃的模擬環境,以及詳細的分析工具來評估模型性能。這些任務挑戰模型制定戰略計劃、高效收集資源、掌握必要技能、製作裝備並擊敗對手,反映了實際場景中的層次依賴性和約束條件。我們對25個最先進的LLMs進行了廣泛評估,包括開源和專有模型,如GPT-5系列,揭示了在傳統推理基準中罕見的顯著性能差異。詳細的錯誤分析進一步揭示了當前模型在生成穩健的高層次計劃和可靠執行結構化行動方面的具體弱點。因此,HeroBench不僅顯著推進了LLM推理的評估,還為未來在虛擬環境中進行高級自主規劃的研究提供了靈活且可擴展的基礎。
English
Large language models (LLMs) have shown remarkable capabilities in isolated
step-by-step reasoning tasks such as mathematics and programming, but their
proficiency in long-horizon planning, where solutions require extended,
structured sequences of interdependent actions, remains underexplored. Existing
benchmarks typically assess LLMs through abstract or low-dimensional
algorithmic tasks, failing to capture the complexity of realistic planning
environments. We introduce HeroBench, a novel benchmark designed specifically
to evaluate long-horizon planning and structured reasoning within complex
RPG-inspired virtual worlds. HeroBench provides a rigorously constructed
dataset of tasks covering a wide range of difficulties, a simulated environment
to execute and validate agent plans, and detailed analytical tools for
evaluating model performance. Tasks challenge models to formulate strategic
plans, efficiently gather resources, master necessary skills, craft equipment,
and defeat adversaries, reflecting practical scenarios' layered dependencies
and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning
both open-source and proprietary models, including the GPT-5 family, reveals
substantial performance disparities rarely observed in conventional reasoning
benchmarks. Detailed error analysis further uncovers specific weaknesses in
current models' abilities to generate robust high-level plans and reliably
execute structured actions. HeroBench thus not only significantly advances the
evaluation of LLM reasoning but also provides a flexible, scalable foundation
for future research into advanced, autonomous planning in virtual environments.