AstroReason-Bench:异构空间规划问题中统一智能体规划能力评估框架
AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems
January 16, 2026
作者: Weiyi Wang, Xinchi Chen, Jingjing Gong, Xuanjing Huang, Xipeng Qiu
cs.AI
摘要
近期具备自主行为能力的大型语言模型(LLM)已发展成为能够跨领域推理与执行的通用规划器。然而现有智能体基准测试主要聚焦符号化或弱实体化环境,导致其在物理约束下的现实领域性能研究尚不充分。我们推出AstroReason-Bench这一综合性基准测试平台,专门用于评估空间规划问题(SPP)中的智能体规划能力——该类高风险问题具有异构目标、严格物理约束和长周期决策等特点。该平台整合了地面站通信、敏捷对地观测等多种调度机制,并提供统一的智能体交互协议。通过对多款前沿开源与闭源智能体LLM系统的评估,我们发现当前智能体在专业求解器面前表现显著逊色,这揭示了通用规划器在现实约束下的关键局限。AstroReason-Bench为未来智能体研究提供了一个兼具挑战性与诊断性的测试平台。
English
Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.