LoHoVLA: 장기간 임베디드 작업을 위한 통합 비전-언어-액션 모델

초록

실제 세계의 구체화된 에이전트들은 단일 동작을 넘어선 다단계 해결책을 요구하는 고수준 목표로 특징지어지는 장기적 과제에 직면합니다. 이러한 과제를 성공적으로 수행하기 위해서는 고수준 작업 계획(즉, 목표를 하위 작업으로 분해)과 저수준 동작 제어(즉, 정밀한 로봇 동작 생성)가 모두 필요합니다. 기존의 시각 언어 동작(VLA) 모델과 계층적 아키텍처는 구체화된 작업에서 잠재력을 보이지만, 전자는 계획에서 종종 실패하고, 후자는 조정 문제로 어려움을 겪어 성능을 저해합니다. 이러한 한계를 극복하기 위해 우리는 장기적 과제를 위한 새로운 통합 VLA 프레임워크인 LoHoVLA를 소개합니다. LoHoVLA는 대규모 사전 학습된 시각 언어 모델(VLM)을 백본으로 활용하여 하위 작업 생성과 로봇 동작 예측을 각각 위한 언어 및 동작 토큰을 공동으로 생성합니다. 이 공유된 표현은 작업 간 더 나은 일반화를 촉진합니다. 또한, LoHoVLA는 고수준 계획과 저수준 제어 모두에서 발생하는 오류를 완화하기 위해 계층적 폐루프 제어 메커니즘을 채택합니다. LoHoVLA를 학습시키기 위해 우리는 Ravens 시뮬레이터를 기반으로 구축된 LoHoSet 데이터셋을 소개합니다. 이 데이터셋은 20개의 장기적 과제로 구성되어 있으며, 각 과제는 시각적 관찰, 언어적 목표, 하위 작업, 로봇 동작으로 이루어진 1,000개의 전문가 데모를 포함합니다. 실험 결과는 LoHoVLA가 Ravens 시뮬레이터에서의 장기적 구체화된 작업에서 계층적 및 표준 VLA 접근법을 크게 능가함을 보여줍니다. 이러한 결과는 일반화 가능한 구체화된 지능을 발전시키기 위한 통합 아키텍처의 가능성을 강조합니다.

English

Real-world embodied agents face long-horizon tasks, characterized by high-level goals demanding multi-step solutions beyond single actions. Successfully navigating these requires both high-level task planning (i.e., decomposing goals into sub-tasks) and low-level motion control (i.e., generating precise robot actions). While existing vision language action (VLA) models and hierarchical architectures offer potential in embodied tasks, the former often falter in planning, and the latter can suffer from coordination issues, both hampering performance. We introduce a new unified VLA framework for long-horizon tasks, dubbed LoHoVLA, to overcome these limitations. LoHoVLA leverages a large pretrained vision language model (VLM) as the backbone to jointly generate language and action tokens for sub-task generation and robot action prediction, respectively. This shared representation promotes better generalization across tasks. Additionally, LoHoVLA embraces a hierarchical closed-loop control mechanism to mitigate errors originating from both high-level planning and low-level control. To train LoHoVLA, we introduce LoHoSet, a dataset built on the Ravens simulator, containing 20 long-horizon tasks, each with 1,000 expert demonstrations composed of visual observations, linguistic goals, sub-tasks, and robot actions. Experimental results show that LoHoVLA significantly surpasses both hierarchical and standard VLA approaches on long-horizon embodied tasks in the Ravens simulator. These findings underscore the promise of unified architectures for advancing generalizable embodied intelligence.

LoHoVLA: 장기간 임베디드 작업을 위한 통합 비전-언어-액션 모델

LoHoVLA: A Unified Vision-Language-Action Model for Long-Horizon Embodied Tasks

초록

Support