ChatPaper.aiChatPaper

MomaGraph:基于视觉语言模型的状态感知统一场景图在具身任务规划中的应用

MomaGraph: State-Aware Unified Scene Graphs with Vision-Language Model for Embodied Task Planning

December 18, 2025
作者: Yuanchen Ju, Yongyuan Liang, Yen-Jen Wang, Nandiraju Gireesh, Yuanliang Ju, Seungjae Lee, Qiao Gu, Elvis Hsieh, Furong Huang, Koushil Sreenath
cs.AI

摘要

家庭移动机械臂需兼具导航与操作能力,这要求构建一种紧凑且语义丰富的场景表征,既能捕捉物体位置信息,又能反映其功能属性与可交互部件。场景图虽是天选之选,但现有研究往往割裂空间与功能关系、将场景视为缺乏物体状态或时序更新的静态快照,并忽视与当前任务最相关的信息。为此,我们提出MomaGraph——一种融合空间功能关系与部件级交互要素的具身智能体统一场景表征。然而,推进此类表征既需要适配数据又需严谨评估,这两方面长期缺失。我们由此贡献了MomaGraph-Scenes:首个面向家庭环境的大规模任务驱动精细标注场景图数据集,以及涵盖从高层规划到细粒度场景理解六项推理能力的系统化评估套件MomaGraph-Bench。基于此基础,我们进一步开发了MomaGraph-R1——一个通过强化学习在MomaGraph-Scenes上训练的70亿参数视觉语言模型。该模型能预测任务导向场景图,并在“先构图后规划”框架下实现零样本任务规划。大量实验表明,我们的模型在开源模型中达到最先进水平,在基准测试中准确率达71.6%(较最佳基线提升11.4%),同时能泛化至公共基准测试并有效迁移至真实机器人实验。
English
Mobile manipulators in households must both navigate and manipulate. This requires a compact, semantically rich scene representation that captures where objects are, how they function, and which parts are actionable. Scene graphs are a natural choice, yet prior work often separates spatial and functional relations, treats scenes as static snapshots without object states or temporal updates, and overlooks information most relevant for accomplishing the current task. To address these limitations, we introduce MomaGraph, a unified scene representation for embodied agents that integrates spatial-functional relationships and part-level interactive elements. However, advancing such a representation requires both suitable data and rigorous evaluation, which have been largely missing. We thus contribute MomaGraph-Scenes, the first large-scale dataset of richly annotated, task-driven scene graphs in household environments, along with MomaGraph-Bench, a systematic evaluation suite spanning six reasoning capabilities from high-level planning to fine-grained scene understanding. Built upon this foundation, we further develop MomaGraph-R1, a 7B vision-language model trained with reinforcement learning on MomaGraph-Scenes. MomaGraph-R1 predicts task-oriented scene graphs and serves as a zero-shot task planner under a Graph-then-Plan framework. Extensive experiments demonstrate that our model achieves state-of-the-art results among open-source models, reaching 71.6% accuracy on the benchmark (+11.4% over the best baseline), while generalizing across public benchmarks and transferring effectively to real-robot experiments.
PDF12December 20, 2025