面向动态环境的通用化机器人操作技术研究

摘要

视觉-语言-动作模型在静态操作任务中表现出色，但在移动目标构成的动态环境中表现欠佳。这一性能差距主要源于动态操作数据集的稀缺，以及主流模型依赖单帧观测而受限的时空推理能力。为此，我们推出DOMINO——一个面向可泛化动态操作的大规模数据集与基准测试平台，包含35项具有层次化复杂度的任务、超过11万条专家演示轨迹，以及多维度的评估体系。通过系统化实验，我们不仅评估了现有模型在动态任务上的表现，探索了提升动态感知能力的有效训练策略，还验证了动态数据的泛化价值。此外，我们提出PUMA这一动态感知的VLA架构：通过融合以场景为中心的历史光流信息和专用世界查询模块，隐式预测以物体为中心的未来状态，该架构实现了历史感知与短时预测的耦合。实验结果表明，PUMA取得了最先进的性能，成功率较基线模型绝对提升6.3%。研究还发现，动态数据训练能形成可迁移至静态任务的鲁棒时空表征。所有代码与数据均已开源：https://github.com/H-EmbodVis/DOMINO。

English

Vision-Language-Action (VLA) models excel in static manipulation but struggle in dynamic environments with moving targets. This performance gap primarily stems from a scarcity of dynamic manipulation datasets and the reliance of mainstream VLAs on single-frame observations, restricting their spatiotemporal reasoning capabilities. To address this, we introduce DOMINO, a large-scale dataset and benchmark for generalizable dynamic manipulation, featuring 35 tasks with hierarchical complexities, over 110K expert trajectories, and a multi-dimensional evaluation suite. Through comprehensive experiments, we systematically evaluate existing VLAs on dynamic tasks, explore effective training strategies for dynamic awareness, and validate the generalizability of dynamic data. Furthermore, we propose PUMA, a dynamics-aware VLA architecture. By integrating scene-centric historical optical flow and specialized world queries to implicitly forecast object-centric future states, PUMA couples history-aware perception with short-horizon prediction. Results demonstrate that PUMA achieves state-of-the-art performance, yielding a 6.3% absolute improvement in success rate over baselines. Moreover, we show that training on dynamic data fosters robust spatiotemporal representations that transfer to static tasks. All code and data are available at https://github.com/H-EmbodVis/DOMINO.

面向动态环境的通用化机器人操作技术研究

Towards Generalizable Robotic Manipulation in Dynamic Environments

摘要

Support