MoZoo: 동물 털과 근육 시뮬레이션에서 비디오 확산의 힘을 발휘하다

초록

영화 수준의 동물 효과를 구현하려면 근육과 털의 동역학을 정밀하게 모델링해야 하며, 이 과정은 기존 제작 파이프라인에서 여전히 많은 노동력과 계산 비용을 요구한다. 생성적 확산 모델이 다양한 예술적 워크플로에서 가능성을 보여주었지만, 고충실도 동물 시뮬레이션을 위한 역량은 아직 충분히 활용되지 못하고 있다. 본 논문에서는 MoZoo를 제안한다. 이는 생성적 동역학 해법기로서, 기존의 정교화 과정을 생략하고 거친 메시로부터 멀티모달 가이던스 하에 고충실도 동물 비디오를 합성한다. 역할 인식 RoPE(Role-Aware RoPE, RAR-RoPE)를 제안하여 역할 기반 인덱스 재매핑을 통해 모션 정렬을 동기화하는 동시에 고정된 시간적 오프셋을 통해 참조 정보를 분리한다. 이와 더불어 비대칭 분리 어텐션(Asymmetric Decoupled Attention)은 잠재 시퀀스를 분할하여 단방향 정보 흐름을 강제함으로써 특징 간섭을 효과적으로 방지하고 계산 효율성을 향상시킨다. 고품질 훈련 데이터의 부족 문제를 해결하기 위해 MoZoo-Data를 도입한다. 이는 렌더링 엔진과 역매핑 접근법을 활용하여 대규모 쌍 시퀀스 데이터셋을 구축하는 합성-실사 파이프라인이다. 또한, 120개의 메시-비디오 쌍으로 구성된 포괄적 벤치마크인 MoZooBench를 구축한다. 실험 결과는 MoZoo가 다양한 동물 골격과 배치에서 고충실도 털 시뮬레이션을 달성하며, 시간적 및 구조적 일관성을 우수하게 유지함을 보여준다.

English

The creation of cinematic-quality animal effects necessitates the precise modeling of muscle and fur dynamics, a process that remains both labor-intensive and computationally expensive within traditional production workflows. While generative diffusion models have shown promise in diverse artistic workflows, their capacity for high-fidelity animal simulation remains largely unexploited. We present MoZoo, a generative dynamics solver that bypasses conventional refinement to synthesize high-fidelity animal videos from coarse meshes under multimodal guidance. We propose Role-Aware RoPE (RAR-RoPE) which employs role-based index remapping to synchronize motion alignment while decoupling reference information via fixed temporal offsets. Complementing this, Asymmetric Decoupled Attention partitions the latent sequence to enforce a unidirectional information flow, effectively preventing feature interference and improving computational efficiency. To address the scarcity of high-quality training data, we introduce MoZoo-Data, a synthetic-to-real pipeline that leverages a rendering engine and an inverse mapping approach to construct a large-scale dataset of paired sequences. Furthermore, we establish MoZooBench, a comprehensive benchmark with 120 mesh-video pairs. Experimental results demonstrate that MoZoo achieves high-fidelity fur simulation across diverse animal skeletons and layouts, preserving superior temporal and structural consistency.