DexGraspVLA：邁向通用靈巧抓取的視覺-語言-動作框架

摘要

靈巧抓取仍然是機器人學中一個基礎且具挑戰性的問題。一個通用型機器人必須能夠在任意場景下抓取多樣化的物體。然而，現有研究通常依賴於特定假設，如單一物體設置或受限環境，導致泛化能力受限。我們的解決方案是DexGraspVLA，這是一個分層框架，利用預訓練的視覺-語言模型作為高層任務規劃器，並學習基於擴散策略的低層動作控制器。其核心洞察在於迭代地將多樣化的語言和視覺輸入轉化為領域不變的表示，從而有效應用模仿學習，因為這緩解了領域偏移問題。因此，它能在廣泛的真實世界場景中實現穩健的泛化。值得注意的是，在“零樣本”環境下，我們的方法在數千種未見過的物體、光照和背景組合中達到了90%以上的成功率。實證分析進一步確認了模型內部行為在環境變化中的一致性，從而驗證了我們的設計並解釋了其泛化性能。我們希望這項工作能為實現通用的靈巧抓取邁出一步。我們的演示和代碼可在https://dexgraspvla.github.io/找到。

English

Dexterous grasping remains a fundamental yet challenging problem in robotics. A general-purpose robot must be capable of grasping diverse objects in arbitrary scenarios. However, existing research typically relies on specific assumptions, such as single-object settings or limited environments, leading to constrained generalization. Our solution is DexGraspVLA, a hierarchical framework that utilizes a pre-trained Vision-Language model as the high-level task planner and learns a diffusion-based policy as the low-level Action controller. The key insight lies in iteratively transforming diverse language and visual inputs into domain-invariant representations, where imitation learning can be effectively applied due to the alleviation of domain shift. Thus, it enables robust generalization across a wide range of real-world scenarios. Notably, our method achieves a 90+% success rate under thousands of unseen object, lighting, and background combinations in a ``zero-shot'' environment. Empirical analysis further confirms the consistency of internal model behavior across environmental variations, thereby validating our design and explaining its generalization performance. We hope our work can be a step forward in achieving general dexterous grasping. Our demo and code can be found at https://dexgraspvla.github.io/.

DexGraspVLA：邁向通用靈巧抓取的視覺-語言-動作框架

DexGraspVLA: A Vision-Language-Action Framework Towards General Dexterous Grasping

摘要

Support