ExoActor: 일반화 가능한 인터랙티브 휴머노이드 제어로서의 외심적 비디오 생성

초록

휴머노이드 제어 시스템은 최근 몇 년간 상당한 발전을 이루었으나, 로봇과 주변 환경, 작업 관련 객체 간의 유창하고 상호작용이 풍부한 행동을 모델링하는 것은 여전히 근본적인 과제로 남아 있습니다. 이러한 어려움은 공간적 맥락, 시간적 역학, 로봇 동작, 작업 의도를 대규모로 종합적으로 포착해야 할 필요에서 비롯되며, 이는 기존의 지도 학습 방식과는 잘 맞지 않습니다. 본 연구에서는 이 문제를 해결하기 위해 대규모 영상 생성 모델의 일반화 능력을 활용한 새로운 프레임워크인 ExoActor를 제안합니다. ExoActor의 핵심 통찰은 상호작용 역학을 모델링하기 위한 통합 인터페이스로 제3자 시점 영상 생성을 사용하는 것입니다. 작업 지시와 장면 맥락이 주어지면 ExoActor는 로봇, 환경, 객체 간의 조율된 상호작용을 암묵적으로 인코딩하는 그럴듯한 실행 과정을 합성합니다. 이러한 영상 출력은 인간 동작을 추정하고 일반적인 모션 컨트롤러를 통해 실행하는 파이프라인을 거쳐 실행 가능한 휴머노이드 행동으로 변환되며, 작업 조건에 따른 행동 시퀀스를 생성합니다. 제안된 프레임워크의 타당성을 검증하기 위해 이를 종단간 시스템으로 구현하고, 추가적인 실세계 데이터 수집 없이도 새로운 시나리오에 대한 일반화 능력을 입증합니다. 더 나아가, 현재 구현체의 한계를 논의하고 향후 연구를 위한 유망한 방향을 제시함으로써, ExoActor가 상호작용이 풍부한 휴머노이드 행동 모델링에 확장 가능한 접근법을 제공하며 생성 모델이 범용 휴머노이드 지능을 발전시키는 새로운 방향을 열 수 있는 가능성을 설명합니다.

English

Humanoid control systems have made significant progress in recent years, yet modeling fluent interaction-rich behavior between a robot, its surrounding environment, and task-relevant objects remains a fundamental challenge. This difficulty arises from the need to jointly capture spatial context, temporal dynamics, robot actions, and task intent at scale, which is a poor match to conventional supervision. We propose ExoActor, a novel framework that leverages the generalization capabilities of large-scale video generation models to address this problem. The key insight in ExoActor is to use third-person video generation as a unified interface for modeling interaction dynamics. Given a task instruction and scene context, ExoActor synthesizes plausible execution processes that implicitly encode coordinated interactions between robot, environment, and objects. Such video output is then transformed into executable humanoid behaviors through a pipeline that estimates human motion and executes it via a general motion controller, yielding a task-conditioned behavior sequence. To validate the proposed framework, we implement it as an end-to-end system and demonstrate its generalization to new scenarios without additional real-world data collection. Furthermore, we conclude by discussing limitations of the current implementation and outlining promising directions for future research, illustrating how ExoActor provides a scalable approach to modeling interaction-rich humanoid behaviors, potentially opening a new avenue for generative models to advance general-purpose humanoid intelligence.

ExoActor: 일반화 가능한 인터랙티브 휴머노이드 제어로서의 외심적 비디오 생성

ExoActor: Exocentric Video Generation as Generalizable Interactive Humanoid Control

초록

Support