시공간 어텐션 체인 기반 빠른 4D 메시 생성

초록

4D 메시 생성은 최근 비디오에서 동적 3D 구조를 복원하기 위한 강력한 패러다임으로 부상했지만, 기존 방법은 여전히 느리고 계산 비용이 많이 들며 더 긴 시퀀스로 확장하기 어렵습니다. 본 논문에서는 학습이 필요 없는 접근법을 도입하여 4D 메시 생성을 가속화하고 시간적 대응 관계의 품질을 향상시킵니다. 핵심 관찰 결과는 시간적 대응 관계가 4D 백본 내에서 생성된 메시가 시각적으로 정확해지기 훨씬 전에 나타난다는 점입니다. 우리는 이를 시공간적 어텐션 체인(Spatio-Temporal Attention Chain)이라는 일반 프레임워크로 활용하여 공간과 시간에 걸쳐 정보를 전파합니다. 앵커 메시의 정점에서 시작하여, 체인은 정점을 잠재 토큰으로 매핑합니다. 그런 다음 잠재 공간에서 시간적 대응 관계를 따르며, 잠재-정점 어텐션을 통해 프레임별 정점을 복원합니다. 이 설계는 비용이 많이 드는 명시적 매칭을 피하면서 앵커 메시의 세부 사항을 보존하여 동적 메시 기하학과 시간적 일관성을 개선합니다. 최신 기술과 비교하여, 우리 방법은 9초 만에 4D 메시를 생성하여 13배의 속도 향상을 달성하면서도 더 높은 품질의 결과를 제공합니다. 또한, 메시 품질 저하 없이 최대 16배 더 긴 비디오로 확장이 가능합니다. 생성 외에도, 개선된 대응 관계는 두 가지 하류 작업인 2D 객체 추적 및 4D 추적에서 경쟁력 있는 제로샷 성능을 가능하게 합니다. 또한, 우리 프레임워크가 이전 4D 메시 생성 방법에서는 지원되지 않았던 신뢰할 수 있는 카메라 추정 기능을 제공함을 추가로 보여줍니다.

English

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.