HiVLA：一种以视觉感知为核心的层次化具身操作系统

摘要

尽管端到端的视觉-语言-动作（VLA）模型为机器人操作提供了前景广阔的范式，但在狭窄的控制数据上对其进行微调往往会损害其从基础视觉-语言模型（VLM）继承的深层推理能力。为解决这一根本性权衡，我们提出HiVLA——一种以视觉定位为核心的分层框架，显式解耦高层语义规划与底层运动控制。在高层部分，VLM规划器首先执行任务分解与视觉定位，生成包含子任务指令和精确目标边界框的结构化方案。随后，为将方案转化为物理动作，我们在底层引入配备新型级联交叉注意力机制的流匹配扩散Transformer（DiT）动作专家。该设计依次融合全局上下文、高分辨率目标中心裁剪区域及技能语义，使DiT能专注于鲁棒执行。我们的解耦架构既保留了VLM的零样本推理能力，又支持两个组件的独立优化。在仿真与真实场景中的大量实验表明，HiVLA显著优于最先进的端到端基线模型，尤其在长周期技能组合和杂乱场景中小物体的精细操作方面表现卓越。

English

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

HiVLA：一种以视觉感知为核心的层次化具身操作系统

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

摘要

Support