HiVLA: 視覚情報を基盤とした階層型身体性操作システム

要旨

エンドツーエンドの視覚言語行動（VLA）モデルはロボットマニピュレーションにおいて有望なパラダイムを提供するが、限定的な制御データでファインチューニングを行うと、基盤となる視覚言語モデル（VLM）から継承した高度な推論能力が損なわれることが多い。この根本的なトレードオフを解決するため、我々はHiVLAを提案する。これは、高レベルの意味的計画と低レベルの運動制御を明示的に分離する、視覚接地中心の階層的フレームワークである。高レベル部では、VLMプランナーがまずタスク分解と視覚接地を実行し、サブタスク指示と精密なターゲットバウンディングボックスから構成される構造化された計画を生成する。次に、この計画を物理的な行動に変換するため、低レベル部には新規のカスケード型クロスアテンション機構を備えたフローマッチング拡散Transformer（DiT）行動エキスパートを導入する。この設計は、大域的な文脈、高解像度の対象物中心のクロップ、技能意味論を順次融合し、DiTがロバストな実行に専念できるようにする。この分離されたアーキテクチャは、VLMのゼロショット推論能力を保持しつつ、両コンポーネントの独立した改善を可能にする。シミュレーションと実世界における大規模な実験により、HiVLAが既存の最先端エンドツーエンドベースラインを大幅に上回り、特に長期的な技能構成と、雑然とした場景における微小物体の細粒度マニピュレーションにおいて優れた性能を発揮することを実証した。

English

While end-to-end Vision-Language-Action (VLA) models offer a promising paradigm for robotic manipulation, fine-tuning them on narrow control data often compromises the profound reasoning capabilities inherited from their base Vision-Language Models (VLMs). To resolve this fundamental trade-off, we propose HiVLA, a visual-grounded-centric hierarchical framework that explicitly decouples high-level semantic planning from low-level motor control. In high-level part, a VLM planner first performs task decomposition and visual grounding to generate structured plans, comprising a subtask instruction and a precise target bounding box. Then, to translate this plan into physical actions, we introduce a flow-matching Diffusion Transformer (DiT) action expert in low-level part equipped with a novel cascaded cross-attention mechanism. This design sequentially fuses global context, high-resolution object-centric crops and skill semantics, enabling the DiT to focus purely on robust execution. Our decoupled architecture preserves the VLM's zero-shot reasoning while allowing independent improvement of both components. Extensive experiments in simulation and the real world demonstrate that HiVLA significantly outperforms state-of-the-art end-to-end baselines, particularly excelling in long-horizon skill composition and the fine-grained manipulation of small objects in cluttered scenes.

HiVLA: 視覚情報を基盤とした階層型身体性操作システム

HiVLA: A Visual-Grounded-Centric Hierarchical Embodied Manipulation System

要旨

Support