AstroReason-Bench: Evaluatie van Geünificeerde Agent-gebaseerde Planning voor Heterogene Ruimteplanningsproblemen

Samenvatting

Recente vooruitgang in agent-gestuurde grote taalmodellen (LLM's) heeft hen gepositioneerd als generalistische planners die kunnen redeneren en handelen in uiteenlopende taken. Bestaande benchmarks voor agents richten zich echter grotendeels op symbolische of zwak gegronde omgevingen, waardoor hun prestaties in fysiek beperkte, realistische domeinen onderbelicht blijven. Wij introduceren AstroReason-Bench, een uitgebreide benchmark voor het evalueren van agent-gestuurd plannen in Ruimteplanningsproblemen (SPP), een familie van hoog-risicoproblemen met heterogene doelstellingen, strikte fysieke beperkingen en besluitvorming over lange tijdshorizons. AstroReason-Bench integreert meerdere planningsregimes, inclusief communicatie met grondstations en flexibele aardobservatie, en biedt een uniform, agent-georiënteerd interactieprotocol. Evaluatie van een reeks state-of-the-art agent-gestuurde LLM-systemen, zowel open-source als closed-source, toont aan dat huidige agents aanzienlijk onderpresteren in vergelijking met gespecialiseerde oplossers. Dit benadrukt cruciale beperkingen van generalistisch plannen onder realistische beperkingen. AstroReason-Bench biedt een uitdagende en diagnostische testomgeving voor toekomstig agent-gericht onderzoek.

English

Recent advances in agentic Large Language Models (LLMs) have positioned them as generalist planners capable of reasoning and acting across diverse tasks. However, existing agent benchmarks largely focus on symbolic or weakly grounded environments, leaving their performance in physics-constrained real-world domains underexplored. We introduce AstroReason-Bench, a comprehensive benchmark for evaluating agentic planning in Space Planning Problems (SPP), a family of high-stakes problems with heterogeneous objectives, strict physical constraints, and long-horizon decision-making. AstroReason-Bench integrates multiple scheduling regimes, including ground station communication and agile Earth observation, and provides a unified agent-oriented interaction protocol. Evaluating on a range of state-of-the-art open- and closed-source agentic LLM systems, we find that current agents substantially underperform specialized solvers, highlighting key limitations of generalist planning under realistic constraints. AstroReason-Bench offers a challenging and diagnostic testbed for future agentic research.

AstroReason-Bench: Evaluatie van Geünificeerde Agent-gebaseerde Planning voor Heterogene Ruimteplanningsproblemen

AstroReason-Bench: Evaluating Unified Agentic Planning across Heterogeneous Space Planning Problems

Samenvatting

Support