VITA-E: 시각, 청각, 발화, 행동을 동시에 수행하는 자연스러운 구체화 상호작용

초록

현재의 시각-언어-행동(VLA) 모델은 경직되고 정적인 상호작용 패러다임에 제한되는 경우가 많아, 사용자의 실시간 인터럽트를 동적으로 처리하고 동시에 보고, 듣고, 말하고, 행동하는 능력이 부족합니다. 이는 원활한 구현형 협업을 저해하여 유연하지 못하고 반응성이 낮은 사용자 경험을 초래합니다. 이러한 한계를 해결하기 위해 우리는 행동 동시성과 준 실시간 인터럽트를 모두 지원하는 새로운 구현형 상호작용 프레임워크인 VITA-E를 소개합니다. 우리 접근법의 핵심은 두 개의 병렬 VLA 인스턴스가 '액티브 모델'과 '대기 모델'로 운영되는 이중 모델 아키텍처로, 구현형 에이전트가 환경을 관찰하고 사용자 음성을 듣고 음성 응답을 제공하며 행동을 실행하는 모든 작업을 인간과 유사한 멀티태스킹 능력처럼 동시적·인터럽트 가능하게 수행할 수 있도록 합니다. 또한 모델의 추론과 시스템의 행동을 결합하기 위해 VLM을 미세 조정하여 시스템 수준의 직접 명령어 역할을 하는 특수 토큰을 생성하는 '모델-어즈-컨트롤러' 패러다임을 제안합니다. 물리적 휴머노이드 플랫폼에서 수행된 실험 결과, VITA-E가 복잡한 상호작용 시나리오를 안정적으로 처리할 수 있음을 입증했습니다. 우리의 프레임워크는 다양한 이중 시스템 VLA 모델과 호환되며, 비상 정지 및 음성 인터럽트에서 극히 높은 성공률을 달성함과 동시에 음성과 행동의 동시 수행에도 성공합니다. 이는 더 자연스럽고 능력 있는 구현형 어시스턴트로 나아가는 중요한 진전을 나타냅니다.

English

Current Vision-Language-Action (VLA) models are often constrained by a rigid, static interaction paradigm, which lacks the ability to see, hear, speak, and act concurrently as well as handle real-time user interruptions dynamically. This hinders seamless embodied collaboration, resulting in an inflexible and unresponsive user experience. To address these limitations, we introduce VITA-E, a novel embodied interaction framework designed for both behavioral concurrency and nearly real-time interruption. The core of our approach is a dual-model architecture where two parallel VLA instances operate as an ``Active Model'' and a ``Standby Model'', allowing the embodied agent to observe its environment, listen to user speech, provide verbal responses, and execute actions, all concurrently and interruptibly, mimicking human-like multitasking capabilities. We further propose a ``model-as-controller'' paradigm, where we fine-tune the VLM to generate special tokens that serve as direct system-level commands, coupling the model's reasoning with the system's behavior. Experiments conducted on a physical humanoid platform demonstrate that VITA-E can reliably handle complex interactive scenarios. Our framework is compatible with various dual-system VLA models, achieving an extremely high success rate on emergency stops and speech interruptions while also successfully performing concurrent speech and action. This represents a significant step towards more natural and capable embodied assistants.

VITA-E: 시각, 청각, 발화, 행동을 동시에 수행하는 자연스러운 구체화 상호작용

VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting

초록

Support