HeroBench: 가상 세계에서 장기적 계획 및 구조화된 추론을 위한 벤치마크

초록

대규모 언어 모델(LLMs)은 수학 및 프로그래밍과 같은 단계별 추론 작업에서 뛰어난 능력을 보여왔지만, 상호 의존적인 행동의 긴 구조적 시퀀스를 요구하는 장기 계획(planning) 분야에서의 숙련도는 아직 충분히 탐구되지 않았습니다. 기존 벤치마크들은 주로 추상적이거나 저차원의 알고리즘 작업을 통해 LLMs를 평가하며, 현실적인 계획 환경의 복잡성을 포착하지 못하고 있습니다. 우리는 복잡한 RPG 스타일의 가상 세계 내에서 장기 계획과 구조적 추론을 평가하기 위해 특별히 설계된 새로운 벤치마크인 HeroBench를 소개합니다. HeroBench는 다양한 난이도를 아우르는 엄격하게 구성된 작업 데이터셋, 에이전트 계획을 실행하고 검증할 수 있는 시뮬레이션 환경, 그리고 모델 성능을 평가하기 위한 상세한 분석 도구를 제공합니다. 이 작업들은 모델이 전략적 계획을 수립하고, 자원을 효율적으로 수집하며, 필요한 기술을 습득하고, 장비를 제작하며, 적을 물리치는 능력을 요구함으로써 실제 시나리오의 계층적 의존성과 제약 조건을 반영합니다. GPT-5 계열을 포함한 오픈소스 및 독점 모델을 아우르는 25개의 최신 LLMs에 대한 광범위한 평가를 통해, 기존 추론 벤치마크에서는 드물게 관찰되는 상당한 성능 격차를 확인했습니다. 상세한 오류 분석은 현재 모델들이 견고한 고수준 계획을 생성하고 구조화된 행동을 안정적으로 실행하는 능력에서의 특정 약점을 추가로 밝혀냈습니다. 따라서 HeroBench는 LLM 추론 평가를 크게 발전시킬 뿐만 아니라, 가상 환경에서의 고급 자율 계획 연구를 위한 유연하고 확장 가능한 기반을 제공합니다.

English

Large language models (LLMs) have shown remarkable capabilities in isolated step-by-step reasoning tasks such as mathematics and programming, but their proficiency in long-horizon planning, where solutions require extended, structured sequences of interdependent actions, remains underexplored. Existing benchmarks typically assess LLMs through abstract or low-dimensional algorithmic tasks, failing to capture the complexity of realistic planning environments. We introduce HeroBench, a novel benchmark designed specifically to evaluate long-horizon planning and structured reasoning within complex RPG-inspired virtual worlds. HeroBench provides a rigorously constructed dataset of tasks covering a wide range of difficulties, a simulated environment to execute and validate agent plans, and detailed analytical tools for evaluating model performance. Tasks challenge models to formulate strategic plans, efficiently gather resources, master necessary skills, craft equipment, and defeat adversaries, reflecting practical scenarios' layered dependencies and constraints. Our extensive evaluation of 25 state-of-the-art LLMs, spanning both open-source and proprietary models, including the GPT-5 family, reveals substantial performance disparities rarely observed in conventional reasoning benchmarks. Detailed error analysis further uncovers specific weaknesses in current models' abilities to generate robust high-level plans and reliably execute structured actions. HeroBench thus not only significantly advances the evaluation of LLM reasoning but also provides a flexible, scalable foundation for future research into advanced, autonomous planning in virtual environments.

HeroBench: 가상 세계에서 장기적 계획 및 구조화된 추론을 위한 벤치마크

HeroBench: A Benchmark for Long-Horizon Planning and Structured Reasoning in Virtual Worlds

초록

Support