TC4D: 궤적 기반 텍스트-투-4D 생성

초록

최근의 텍스트-투-4D 생성 기술은 사전 학습된 텍스트-투-비디오 모델의 지도를 활용하여 동적인 3D 장면을 합성합니다. 그러나 기존의 모션 표현 방식, 예를 들어 변형 모델이나 시간 의존적 신경망 표현 등은 생성할 수 있는 모션의 양에 제한이 있어, 볼륨 렌더링에 사용된 경계 상자를 크게 벗어나는 모션을 합성할 수 없습니다. 이러한 더 유연한 모션 모델의 부재는 4D 생성 방법과 최근의 사실적인 비디오 생성 모델 간의 현실감 차이를 초래합니다. 본 연구에서는 TC4D: 궤적 조건부 텍스트-투-4D 생성을 제안하며, 모션을 전역적 요소와 지역적 요소로 분해합니다. 우리는 스플라인으로 매개변수화된 궤적을 따라 경계 상자의 전역적 모션을 강체 변환으로 표현합니다. 또한 텍스트-투-비디오 모델의 지도를 활용하여 전역 궤적에 부합하는 지역적 변형을 학습합니다. 우리의 접근 방식은 임의의 궤적을 따라 애니메이션된 장면의 합성, 구성적 장면 생성, 그리고 생성된 모션의 현실감과 양의 상당한 개선을 가능하게 하며, 이를 정성적으로 평가하고 사용자 연구를 통해 검증합니다. 비디오 결과는 우리의 웹사이트(https://sherwinbahmani.github.io/tc4d)에서 확인할 수 있습니다.

English

Recent techniques for text-to-4D generation synthesize dynamic 3D scenes using supervision from pre-trained text-to-video models. However, existing representations for motion, such as deformation models or time-dependent neural representations, are limited in the amount of motion they can generate-they cannot synthesize motion extending far beyond the bounding box used for volume rendering. The lack of a more flexible motion model contributes to the gap in realism between 4D generation methods and recent, near-photorealistic video generation models. Here, we propose TC4D: trajectory-conditioned text-to-4D generation, which factors motion into global and local components. We represent the global motion of a scene's bounding box using rigid transformation along a trajectory parameterized by a spline. We learn local deformations that conform to the global trajectory using supervision from a text-to-video model. Our approach enables the synthesis of scenes animated along arbitrary trajectories, compositional scene generation, and significant improvements to the realism and amount of generated motion, which we evaluate qualitatively and through a user study. Video results can be viewed on our website: https://sherwinbahmani.github.io/tc4d.

TC4D: 궤적 기반 텍스트-투-4D 생성

TC4D: Trajectory-Conditioned Text-to-4D Generation

초록

Support