DIMO: 임의의 객체를 위한 다양한 3D 모션 생성

초록

단일 이미지로부터 임의의 객체에 대한 다양한 3D 동작을 생성할 수 있는 생성적 접근법인 DIMO를 제안합니다. 우리 연구의 핵심 아이디어는 사전 학습된 비디오 모델의 풍부한 사전 지식을 활용하여 공통된 동작 패턴을 추출한 후, 이를 공유 저차원 잠재 공간에 임베딩하는 것입니다. 구체적으로, 우리는 먼저 동일한 객체에 대해 다양한 동작을 가진 여러 비디오를 생성합니다. 그런 다음 각 동작을 잠재 벡터로 임베딩하고, 공유 동작 디코더를 학습시켜 구조화되고 간결한 동작 표현(즉, 신경망 키 포인트 궤적)으로 나타낸 동작의 분포를 학습합니다. 이후 표준 3D 가우시안들이 이러한 키 포인트에 의해 구동되며 형상과 외관을 모델링하기 위해 융합됩니다. 학습된 잠재 공간을 이용한 추론 시에는 단일 순전파 과정으로 다양한 3D 동작을 즉시 샘플링할 수 있으며, 3D 동작 보간 및 언어 유도 동작 생성을 포함한 여러 흥미로운 응용을 지원합니다. 우리의 프로젝트 페이지는 https://linzhanm.github.io/dimo에서 확인할 수 있습니다.

English

We present DIMO, a generative approach capable of generating diverse 3D motions for arbitrary objects from a single image. The core idea of our work is to leverage the rich priors in well-trained video models to extract the common motion patterns and then embed them into a shared low-dimensional latent space. Specifically, we first generate multiple videos of the same object with diverse motions. We then embed each motion into a latent vector and train a shared motion decoder to learn the distribution of motions represented by a structured and compact motion representation, i.e., neural key point trajectories. The canonical 3D Gaussians are then driven by these key points and fused to model the geometry and appearance. During inference time with learned latent space, we can instantly sample diverse 3D motions in a single-forward pass and support several interesting applications including 3D motion interpolation and language-guided motion generation. Our project page is available at https://linzhanm.github.io/dimo.

DIMO: 임의의 객체를 위한 다양한 3D 모션 생성

DIMO: Diverse 3D Motion Generation for Arbitrary Objects

초록

Support