在視覺運動策略中是否需要本體感知狀態？

摘要

基於模仿學習的視覺運動策略已廣泛應用於機器人操作中，通常結合視覺觀測和本體感知狀態以實現精確控制。然而，本研究發現，這種常見做法會使策略過度依賴本體感知狀態輸入，導致對訓練軌跡的過擬合，並造成空間泛化能力不佳。相反，我們提出了無狀態策略（State-free Policy），移除本體感知狀態輸入，僅基於視覺觀測預測動作。無狀態策略建立在相對末端執行器動作空間中，並應確保完整的任務相關視覺觀測，這裡由雙廣角腕部相機提供。實驗結果表明，無狀態策略在空間泛化能力上顯著優於基於狀態的策略：在真實世界的任務中，如拾取放置、具有挑戰性的衣物摺疊以及複雜的全身操作，涵蓋多種機器人實體，其平均成功率在高度泛化上從0%提升至85%，在水平泛化上從6%提升至64%。此外，它們在數據效率和跨實體適應性方面也展現出優勢，增強了其在實際部署中的實用性。

English

Imitation-learning-based visuomotor policies have been widely used in robot manipulation, where both visual observations and proprioceptive states are typically adopted together for precise control. However, in this study, we find that this common practice makes the policy overly reliant on the proprioceptive state input, which causes overfitting to the training trajectories and results in poor spatial generalization. On the contrary, we propose the State-free Policy, removing the proprioceptive state input and predicting actions only conditioned on visual observations. The State-free Policy is built in the relative end-effector action space, and should ensure the full task-relevant visual observations, here provided by dual wide-angle wrist cameras. Empirical results demonstrate that the State-free policy achieves significantly stronger spatial generalization than the state-based policy: in real-world tasks such as pick-and-place, challenging shirt-folding, and complex whole-body manipulation, spanning multiple robot embodiments, the average success rate improves from 0\% to 85\% in height generalization and from 6\% to 64\% in horizontal generalization. Furthermore, they also show advantages in data efficiency and cross-embodiment adaptation, enhancing their practicality for real-world deployment.

在視覺運動策略中是否需要本體感知狀態？

Do You Need Proprioceptive States in Visuomotor Policies?

摘要

Support