AdaPlanBench: 大規模言語モデルエージェントにおける世界とユーザの制約下での適応的計画の評価

要旨

言語モデルによる実世界の問題に対する計画立案は、多くの場合、世界制約とユーザー制約の両方を伴い、これらは事前に完全には指定されず、相互作用を通じて段階的に明らかになる。しかし、既存のベンチマークでは、このような段階的に明らかになる二重制約下での適応的計画立案はまだ十分に探求されていない。このギャップを埋めるため、我々はAdaPlanBenchを導入する。これは、大規模言語モデル（LLM）エージェントが段階的に明らかになる世界制約とユーザー制約の下で適応的に計画を立て、再計画できるかを評価するための動的インタラクティブベンチマークである。AdaPlanBenchは307の家事タスクに基づいて構築されており、各タスクに二重制約を追加するスケーラブルな制約構築パイプラインを備えている。実行時には、エージェントはマルチターンプロトコルで環境と対話し、隠された制約はエージェントがそれに違反する計画を提案した場合にのみ明らかになり、蓄積されるフィードバックの下で計画を反復的に修正する必要がある。これにより、エージェントはフィードバックから制約を推論・追跡しながら効果的に再計画を行う必要があるため、計画立案は困難なものとなる。10の主要なLLMを用いた実験では、二重制約下での適応的計画立案は依然として困難であり、最高のモデルでも67.75%の精度にとどまった。さらに、制約が蓄積されるにつれて性能が低下し、ユーザー制約が特に大きな課題となり、失敗は多くの場合、より弱い物理的根拠と有効性の低下に起因することが観察された。これらの結果は、AdaPlanBenchを二重制約下でのインタラクティブな計画立案のためのテストベッドとして確立し、LLMエージェントにおける動的に明らかになる制約への信頼性の高い適応が困難であることを浮き彫りにしている。

English

Planning for real-world problems by language models often involves both world and user constraints, which may not be fully specified upfront and are progressively disclosed through interaction. However, existing benchmarks still underexplore adaptive planning under such progressively revealed dual constraints. To address this gap, we introduce AdaPlanBench, a dynamic interactive benchmark for evaluating whether Large Language Model (LLM) agents can adaptively plan and re-plan under progressively revealed world and user constraints. AdaPlanBench is built on 307 household tasks, with a scalable constraint construction pipeline that augments each task with dual constraints. At runtime, agents interact with the environment in a multi-turn protocol where hidden constraints are revealed only when the agent proposes a plan that violates them, requiring iterative plan revision under accumulating feedback. This makes planning challenging, as agents must infer and track constraints from feedback while re-planning effectively. Experiments on ten leading LLMs show that adaptive planning under dual constraints remains challenging, with the best model reaching only 67.75% accuracy. We further observe that performance degrades as more constraints accumulate, with user constraints posing a particularly large challenge and failures often stemming from weaker physical grounding and reduced effectiveness. These results establish AdaPlanBench as a testbed for dual-constrained interactive planning and highlight the challenge of reliable adaptation to dynamically revealed constraints in LLM agents.