VITA-E:具備同步視覺、聽覺、語音與行動能力的自然實體互動系統
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
October 21, 2025
作者: Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He
cs.AI
摘要
当前视觉-语言-动作模型常受限于僵化的静态交互范式,无法同步实现观察、聆听、言语和行动等能力,亦难以动态处理实时用户中断。这种局限性阻碍了具身智能体的无缝协作,导致用户体验呆板迟滞。为解决这些问题,我们提出VITA-E——一种创新性具身交互框架,兼具行为并发性与准实时中断处理能力。该框架核心采用双模型架构:两个并行运行的VLA实例分别作为「主动模型」与「待机模型」,使具身智能体能够像人类多任务处理那样,同步且可中断地执行环境感知、语音监听、言语回应及动作执行。我们进一步提出「模型即控制器」范式,通过微调视觉语言模型生成特殊标记作为系统级指令,将模型推理与系统行为深度耦合。在实体人形机器人平台上的实验表明,VITA-E能可靠处理复杂交互场景。本框架兼容多种双系统VLA模型,在紧急停止与语音中断场景中达成极高成功率,同时完美实现语音行动并发执行。这一成果标志着向更自然、更强大的具身助手迈出重要一步。
English
Current Vision-Language-Action (VLA) models are often constrained by a rigid,
static interaction paradigm, which lacks the ability to see, hear, speak, and
act concurrently as well as handle real-time user interruptions dynamically.
This hinders seamless embodied collaboration, resulting in an inflexible and
unresponsive user experience. To address these limitations, we introduce
VITA-E, a novel embodied interaction framework designed for both behavioral
concurrency and nearly real-time interruption. The core of our approach is a
dual-model architecture where two parallel VLA instances operate as an ``Active
Model'' and a ``Standby Model'', allowing the embodied agent to observe its
environment, listen to user speech, provide verbal responses, and execute
actions, all concurrently and interruptibly, mimicking human-like multitasking
capabilities. We further propose a ``model-as-controller'' paradigm, where we
fine-tune the VLM to generate special tokens that serve as direct system-level
commands, coupling the model's reasoning with the system's behavior.
Experiments conducted on a physical humanoid platform demonstrate that VITA-E
can reliably handle complex interactive scenarios. Our framework is compatible
with various dual-system VLA models, achieving an extremely high success rate
on emergency stops and speech interruptions while also successfully performing
concurrent speech and action. This represents a significant step towards more
natural and capable embodied assistants.