HiVLA: 시각 기반 중심의 계층적 구현 조작 시스템

초록

종단간 비전-언어-행동(VLA) 모델은 로봇 매니픽레이션에 유망한 패러다임을 제시하지만, 제한된 제어 데이터에 대한 미세 조정은 종종 기본 비전-언어 모델(VLM)에서 계승된 심층 추론 능력을 훼손합니다. 이러한 근본적인 트레이드오프를 해결하기 위해, 우리는 상위 수준의 의미론적 계획과 하위 수준의 운동 제어를 명시적으로 분리하는 시각 기반 중심의 계층적 프레임워크인 HiVLA를 제안합니다. 상위 수준에서는 VLM 플래너가 작업 분해와 시각 기반을 수행하여 구조화된 계획(하위 작업 명령과 정확한 대상 경계 상자로 구성)을 생성합니다. 그런 다음 이 계획을 물리적 행동으로 변환하기 위해, 하위 수준에서는 새로운 캐스케이드 교차 주의 메커니즘을 갖춘 흐름 매칭 확산 트랜스포머(DiT) 행동 전문가를 도입합니다. 이 설계는 전역 컨텍스트, 고해상도 객체 중심 크롭 및 기술 의미론을 순차적으로 융합하여 DiT가 강력한 실행에만 집중할 수 있게 합니다. 우리의 분리된 아키텍처는 VLM의 제로샷 추론 능력을 보존하면서 두 구성 요소의 독립적인 개선을 가능하게 합니다. 시뮬레이션과 실제 환경에서의 광범위한 실험을 통해 HiVLA가 최신 종단간 베이스라인을 크게 능가하며, 특히 장기간 기술 구성과 복잡한 환경에서 작은 객체의 세밀한 조작에서 탁월한 성능을 보임을 입증했습니다.

English

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

HiVLA: 시각 기반 중심의 계층적 구현 조작 시스템

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

초록

Support