UniSkill: 크로스-엠보디먼트 스킬 표현을 통한 인간 동영상 모방

초록

모방은 인간의 기본적인 학습 메커니즘으로, 개인이 전문가를 관찰하고 흉내 내며 새로운 작업을 배울 수 있게 합니다. 그러나 이러한 능력을 로봇에 적용하는 것은 인간과 로봇의 시각적 외형과 물리적 능력 사이의 근본적인 차이로 인해 상당한 어려움을 겪습니다. 기존 방법들은 공통된 장면과 작업을 포함하는 교차 구현체 데이터셋을 사용하여 이러한 격차를 메우려 했지만, 인간과 로봇 간의 정렬된 데이터를 대규모로 수집하는 것은 간단한 일이 아닙니다. 본 논문에서는 대규모 교차 구현체 비디오 데이터에서 레이블 없이 구현체에 구애받지 않는 기술 표현을 학습하는 새로운 프레임워크인 UniSkill을 제안합니다. 이를 통해 인간 비디오 프롬프트에서 추출된 기술이 로봇 데이터만으로 훈련된 로봇 정책에 효과적으로 전이될 수 있습니다. 시뮬레이션과 실제 환경에서의 실험 결과, 우리의 교차 구현체 기술은 보지 못한 비디오 프롬프트에서도 로봇이 적절한 행동을 선택하도록 성공적으로 안내함을 보여줍니다. 프로젝트 웹사이트는 https://kimhanjung.github.io/UniSkill에서 확인할 수 있습니다.

English

Mimicry is a fundamental learning mechanism in humans, enabling individuals to learn new tasks by observing and imitating experts. However, applying this ability to robots presents significant challenges due to the inherent differences between human and robot embodiments in both their visual appearance and physical capabilities. While previous methods bridge this gap using cross-embodiment datasets with shared scenes and tasks, collecting such aligned data between humans and robots at scale is not trivial. In this paper, we propose UniSkill, a novel framework that learns embodiment-agnostic skill representations from large-scale cross-embodiment video data without any labels, enabling skills extracted from human video prompts to effectively transfer to robot policies trained only on robot data. Our experiments in both simulation and real-world environments show that our cross-embodiment skills successfully guide robots in selecting appropriate actions, even with unseen video prompts. The project website can be found at: https://kimhanjung.github.io/UniSkill.

UniSkill: 크로스-엠보디먼트 스킬 표현을 통한 인간 동영상 모방

UniSkill: Imitating Human Videos via Cross-Embodiment Skill Representations

초록

Support