HiVLA：以視覺為基礎的階層式實體操作系統

摘要

雖然端到端的視覺-語言-動作模型為機器人操作提供了前景廣闊的範式，但在狹窄控制資料上對其進行微調往往會損害其從基礎視覺-語言模型繼承的深度推理能力。為解決這一根本性權衡，我們提出HiVLA——一個以視覺定位為核心的分層框架，明確將高層語義規劃與低層運動控制解耦。在高層部分，視覺-語言規劃器首先執行任務分解與視覺定位，生成包含子任務指令和精確目標邊界框的結構化計劃。隨後為將計劃轉化為實體動作，我們在低層部分引入配備新型級聯交叉注意力機制的流匹配擴散轉換器動作專家。該設計依序融合全域上下文、高解析度物體中心裁剪區域及技能語義，使擴散轉換器能專注於魯棒執行。我們的解耦架構既保留了視覺-語言模型的零樣本推理能力，又支持兩個組件的獨立改進。大量模擬與實物實驗表明，HiVLA在長週期技能組合和雜亂場景中小物體的細粒度操作方面表現尤為突出，顯著優於現有端到端基線模型。

English

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

HiVLA：以視覺為基礎的階層式實體操作系統

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

摘要

Support