ChatPaper.aiChatPaper

MomaGraph:基于视觉语言模型的状态感知统一场景图在具身任务规划中的应用

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

December 18, 2025
作者: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
cs.AI

摘要

家用移动机械臂需兼具导航与操作能力,这要求建立紧凑且语义丰富的场景表征系统,能同时捕捉物体位置、功能属性及可操作部件。场景图虽是天选之选,但现有研究往往割裂空间与功能关系、将场景视为缺乏物体状态或时序更新的静态快照,并忽视与当前任务最相关的信息。为突破这些局限,我们提出MomaGraph——一种融合空间功能关系与部件级交互要素的具身智能体统一场景表征。然而推进该表征体系既需要适配数据也需严谨评估,这两者长期缺位。为此我们贡献了MomaGraph-Scenes:首个包含丰富标注的家居环境任务驱动场景图大规模数据集,以及涵盖从高层规划到细粒度场景理解六项推理能力的系统化评估套件MomaGraph-Bench。基于此,我们进一步开发了经强化学习训练的70亿参数视觉语言模型MomaGraph-R1。该模型能预测任务导向场景图,并在"先构图后规划"框架下实现零样本任务规划。大量实验表明,我们的模型在开源模型中达到最先进水平,在基准测试中准确率达71.6%(较最佳基线提升11.4%),同时在公共基准上展现良好泛化能力,并能有效迁移至真实机器人实验。
English
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
PDF12December 20, 2025