ChatPaper.aiChatPaper

共同烹饪与清洁:面向并行任务执行的具身智能体教学研究

Cook and Clean Together: Teaching Embodied Agents for Parallel Task Execution

November 24, 2025
作者: Dingkang Liang, Cheng Zhang, Xiaopeng Xu, Jianzhong Ju, Zhenbo Luo, Xiang Bai
cs.AI

摘要

任务调度是具身智能的核心能力,使智能体能够遵循自然语言指令在三维物理世界中高效执行动作。然而现有数据集常忽略运筹学知识与三维空间 grounding,过度简化任务规划过程。本研究提出基于运筹学知识的三维实体任务调度新任务,该任务要求实现语言理解、三维空间定位与效率优化的协同。与既有设定不同,ORS3D要求智能体通过利用可并行子任务来最小化总完成时间,例如在微波炉工作时同步清洁水槽。为促进该方向研究,我们构建了包含4K真实场景中6万项复合任务的大规模数据集ORS3D-60K。此外,我们提出GRANT模型——配备简单高效调度令牌机制的具身多模态大语言模型,可生成优化任务调度方案与实体化动作。在ORS3D-60K上的大量实验验证了GRANT在语言理解、三维空间定位和调度效率方面的卓越性能。代码已开源:https://github.com/H-EmbodVis/GRANT
English
Task scheduling is critical for embodied AI, enabling agents to follow natural language instructions and execute actions efficiently in 3D physical worlds. However, existing datasets often simplify task planning by ignoring operations research (OR) knowledge and 3D spatial grounding. In this work, we propose Operations Research knowledge-based 3D Grounded Task Scheduling (ORS3D), a new task that requires the synergy of language understanding, 3D grounding, and efficiency optimization. Unlike prior settings, ORS3D demands that agents minimize total completion time by leveraging parallelizable subtasks, e.g., cleaning the sink while the microwave operates. To facilitate research on ORS3D, we construct ORS3D-60K, a large-scale dataset comprising 60K composite tasks across 4K real-world scenes. Furthermore, we propose GRANT, an embodied multi-modal large language model equipped with a simple yet effective scheduling token mechanism to generate efficient task schedules and grounded actions. Extensive experiments on ORS3D-60K validate the effectiveness of GRANT across language understanding, 3D grounding, and scheduling efficiency. The code is available at https://github.com/H-EmbodVis/GRANT
PDF72December 1, 2025