共同烹饪与清洁:教授具身智能体并行执行任务
Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution
November 24, 2025
作者: Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai
cs.AI
摘要
任务调度在具身人工智能中至关重要,它使智能体能够遵循自然语言指令,在三维物理世界中高效执行动作。然而,现有数据集常因忽略运筹学知识和三维空间 grounding 而简化了任务规划。本研究提出基于运筹学知识的三维 grounded 任务调度(ORS3D),这一新任务要求融合语言理解、三维 grounding 与效率优化。与先前设定不同,ORS3D 要求智能体通过利用可并行子任务(如在微波炉运行时同时清洁水槽)来最小化总完成时间。为促进 ORS3D 研究,我们构建了 ORS3D-60K 大规模数据集,包含 4000 个真实场景中的 6 万项复合任务。此外,我们提出 GRANT——一个配备简单有效调度令牌机制的具身多模态大语言模型,能生成高效的任务调度方案与 grounded 动作。在 ORS3D-60K 上的大量实验验证了 GRANT 在语言理解、三维 grounding 和调度效率方面的有效性。代码已开源:https://github.com/H-EmbodVis/GRANT
English
Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT