AdaPlanBench: 세계 및 사용자 제약 하의 대규모 언어 모델 에이전트 적응형 계획 평가

초록

언어 모델이 실제 문제를 계획할 때는 종종 세계 및 사용자 제약이 포함되며, 이러한 제약은 처음부터 완전히 명시되지 않고 상호작용을 통해 점진적으로 공개된다. 그러나 기존의 벤치마크는 점진적으로 공개되는 이러한 이중 제약 하에서의 적응적 계획을 충분히 탐구하지 못하고 있다. 이러한 격차를 해소하기 위해, 우리는 AdaPlanBench를 소개한다. 이는 대규모 언어 모델(LLM) 에이전트가 점진적으로 공개되는 세계 및 사용자 제약 하에서 적응적으로 계획하고 재계획할 수 있는지 평가하는 동적 상호작용 벤치마크이다. AdaPlanBench는 307개의 가사 과제를 기반으로 구축되었으며, 각 과제에 이중 제약을 추가하는 확장 가능한 제약 구성 파이프라인을 갖추고 있다. 런타임에서 에이전트는 다중 턴 프로토콜을 통해 환경과 상호작용하며, 숨겨진 제약은 에이전트가 이를 위반하는 계획을 제안할 때만 공개되어 누적되는 피드백 하에서 반복적인 계획 수정을 요구한다. 이는 에이전트가 효과적으로 재계획하면서 피드백으로부터 제약을 추론하고 추적해야 하므로 계획을 어렵게 만든다. 10개의 주요 LLM에 대한 실험 결과, 이중 제약 하에서의 적응적 계획은 여전히 어려운 과제이며, 최고 모델이 67.75%의 정확도에 그쳤다. 또한 더 많은 제약이 축적됨에 따라 성능이 저하되며, 사용자 제약이 특히 큰 도전 과제를 제기하고, 실패는 종종 약한 물리적 기반과 감소된 효과성에서 비롯된다는 점을 관찰했다. 이러한 결과는 AdaPlanBench를 이중 제약 상호작용 계획을 위한 테스트베드로 확립하며, LLM 에이전트에서 동적으로 공개되는 제약에 대한 신뢰할 수 있는 적응의 어려움을 강조한다.

English

Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.