EgoActor:透過視覺語言模型將人形機器人的任務規劃落地於空間感知的自我中心行動
EgoActor: Grounding Task Planning into Spatial-aware Egocentric Actions for Humanoid Robots via Visual-Language Models
February 4, 2026
作者: Yu Bai, MingMing Yu, Chaojie Li, Ziyi Bai, Xinlong Wang, Börje F. Karlsson
cs.AI
摘要
在現實世界中部署人形機器人面臨根本性挑戰,因其需要在部分信息觀測和動態變化的環境下,實現感知、運動與操作能力的緊密整合,並在不同類型的子任務間實現穩健轉換。為應對這些挑戰,我們提出一項新任務——EgoActing,該任務要求將高層級指令直接具象化為多樣化、精確且具空間感知能力的人形動作。我們進一步通過引入EgoActor來實例化此任務,這是一個統一且可擴展的視覺語言模型,能夠預測運動基元(如行走、轉彎、側移、高度調整)、頭部運動、操作指令以及人機互動行為,以實現感知與執行的實時協調。我們利用來自真實世界示範的純RGB自我中心視角數據、空間推理問答及模擬環境示範進行廣泛監督訓練,使EgoActor能夠在8B和4B參數模型下做出穩健的上下文感知決策,並在1秒內完成流暢的動作推論。在模擬與真實環境中的大量評估表明,EgoActor能有效橋接抽象任務規劃與具體運動執行,同時在多元任務和未見環境中展現出卓越的泛化能力。
English
Deploying humanoid robots in real-world settings is fundamentally challenging, as it demands tight integration of perception, locomotion, and manipulation under partial-information observations and dynamically changing environments. As well as transitioning robustly between sub-tasks of different types. Towards addressing these challenges, we propose a novel task - EgoActing, which requires directly grounding high-level instructions into various, precise, spatially aware humanoid actions. We further instantiate this task by introducing EgoActor, a unified and scalable vision-language model (VLM) that can predict locomotion primitives (e.g., walk, turn, move sideways, change height), head movements, manipulation commands, and human-robot interactions to coordinate perception and execution in real-time. We leverage broad supervision over egocentric RGB-only data from real-world demonstrations, spatial reasoning question-answering, and simulated environment demonstrations, enabling EgoActor to make robust, context-aware decisions and perform fluent action inference (under 1s) with both 8B and 4B parameter models. Extensive evaluations in both simulated and real-world environments demonstrate that EgoActor effectively bridges abstract task planning and concrete motor execution, while generalizing across diverse tasks and unseen environments.