VITA-E:融合视觉、听觉、语音与动作的并发式自然交互系统
VITA-E: Natural Embodied Interaction with Concurrent Seeing, Hearing, Speaking, and Acting
October 21, 2025
作者: Xiaoyu Liu, Chaoyou Fu, Chi Yan, Chu Wu, Haihan Gao, Yi-Fan Zhang, Shaoqi Dong, Cheng Qian, Bin Luo, Xiuyong Yang, Guanwu Li, Yusheng Cai, Yunhang Shen, Deqiang Jiang, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He
cs.AI
摘要
当前视觉-语言-动作模型常受限于僵化、静态的交互范式,缺乏同步观察、聆听、表达与执行的能力,且无法动态处理实时用户中断。这种局限性阻碍了具身智能体的无缝协同,导致用户体验僵硬且响应迟缓。为解决这些问题,我们提出VITA-E——一种创新性具身交互框架,兼具行为并发与准实时中断处理能力。该框架的核心是双模型架构:两个并行运行的VLA实例分别作为"主动模型"和"待机模型",使具身智能体能够像人类多任务处理那样,同步且可中断地执行环境感知、语音监听、对话回应及动作执行。我们进一步提出"模型即控制器"范式,通过微调视觉语言模型生成特殊标记作为直接系统指令,将模型推理与系统行为紧密耦合。在实体人形机器人平台上的实验表明,VITA-E能可靠处理复杂交互场景。本框架兼容多种双系统VLA模型,在紧急停止和语音中断场景中实现极高成功率,同时成功完成语音与动作的并发执行。这一研究为构建更自然、更强大的具身辅助系统迈出重要一步。
English
Current Vision-Language-Action (VLA) models are often constrained by a rigid,
static interaction paradigm, which lacks the ability to see, hear, speak, and
act concurrently as well as handle real-time user interruptions dynamically.
This hinders seamless embodied collaboration, resulting in an inflexible and
unresponsive user experience. To address these limitations, we introduce
VITA-E, a novel embodied interaction framework designed for both behavioral
concurrency and nearly real-time interruption. The core of our approach is a
dual-model architecture where two parallel VLA instances operate as an ``Active
Model'' and a ``Standby Model'', allowing the embodied agent to observe its
environment, listen to user speech, provide verbal responses, and execute
actions, all concurrently and interruptibly, mimicking human-like multitasking
capabilities. We further propose a ``model-as-controller'' paradigm, where we
fine-tune the VLM to generate special tokens that serve as direct system-level
commands, coupling the model's reasoning with the system's behavior.
Experiments conducted on a physical humanoid platform demonstrate that VITA-E
can reliably handle complex interactive scenarios. Our framework is compatible
with various dual-system VLA models, achieving an extremely high success rate
on emergency stops and speech interruptions while also successfully performing
concurrent speech and action. This represents a significant step towards more
natural and capable embodied assistants.