EgoActor:基于视觉语言模型将人形机器人任务规划空间感知化地融入以自我为中心的行动中
EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
February 4, 2026
作者: Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang, Börje F. Karlsson
cs.AI
摘要
在现实场景中部署人形机器人具有根本性挑战,这要求机器人在部分信息观测和动态变化环境下,实现感知、运动与操作的紧密协同,并能在不同类型子任务间稳健切换。为应对这些挑战,我们提出一项新任务——自我行为模拟(EgoActing),该任务要求将高层指令直接具象化为多样化、高精度且具有空间意识的人形机器人动作。我们进一步通过引入EgoActor模型来实例化该任务,这是一个统一且可扩展的视觉语言模型(VLM),能够同步预测运动基元(如行走、转向、侧移、高度调整)、头部运动、操作指令及人机交互行为,从而实现感知与执行的实时协同。通过融合来自真实世界演示的纯RGB第一视角数据、空间推理问答以及模拟环境演示的广泛监督信号,EgoActor的8B和4B参数模型均能实现稳健的上下文感知决策与流畅的动作推理(响应时间低于1秒)。在模拟与真实环境中的大量实验表明,EgoActor能有效衔接抽象任务规划与具体运动执行,并在多样化任务及未知环境中展现出卓越的泛化能力。
English
Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.