在视觉运动策略中，是否需要本体感知状态？

摘要

基于模仿学习的视觉运动策略已广泛应用于机器人操控领域，通常结合视觉观测与本体感知状态以实现精确控制。然而，本研究发现，这一常规做法导致策略过度依赖本体感知状态输入，从而对训练轨迹产生过拟合，造成空间泛化能力不足。为此，我们提出“无状态策略”，摒弃本体感知状态输入，仅依据视觉观测预测动作。该策略构建于相对末端执行器动作空间，并确保获取完整的任务相关视觉观测，此处通过双广角腕部摄像头实现。实证结果表明，无状态策略在空间泛化能力上显著优于基于状态的策略：在现实世界的任务中，如拾取放置、具有挑战性的衣物折叠及复杂的全身操控，跨越多种机器人形态，其高度泛化的平均成功率从0%提升至85%，水平泛化从6%提升至64%。此外，该策略在数据效率和跨形态适应方面也展现出优势，增强了其在实际部署中的实用性。

English

Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0\% to 85\% in height generalization and from 6\% to 64\% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.

在视觉运动策略中，是否需要本体感知状态？

Do You Need Proprioceptive States in Visuomotor Policies?

摘要

Support