基於時空注意力鏈的快速4D網格生成

摘要

4D網格生成近期已成為一種從影片中重建動態3D結構的強大典範，但現有方法仍存在速度慢、計算成本高且難以擴展至更長序列的問題。我們提出一種免訓練方法，在加速4D網格生成的同時提升時間對應品質。我們的核心觀察在於，時間對應關係早在4D主幹網路生成的網格達到視覺精確之前，便已在此主幹網路中湧現。我們藉此提出一個通用框架，稱為時空注意力鏈，它能跨空間與時間傳播資訊。從錨點網格上的頂點出發，該鏈將頂點映射至潛在標記；接著遵循潛在空間中的時間對應關係，並透過潛在至頂點的注意力機制恢復出各幀特有的頂點。此設計避免了昂貴的顯式匹配，同時保留錨點網格的細節，進而改善動態網格幾何結構與時間一致性。與現有最佳方法相比，我們的方法僅需9秒即可生成一個4D網格，實現13倍加速，同時產出更高品質的結果。此外，我們的方法可擴展至長達16倍的影片序列，且不降低網格品質。除了生成任務之外，改善後的對應關係使我們在兩項下游任務——2D物體追蹤與4D追蹤——中達到具競爭力的零次學習表現。我們進一步展示，本框架能實現可靠的相機估計，這項能力是先前4D網格生成方法所無法支援的。

English

4D mesh generation has recently emerged as a powerful paradigm for recovering dynamic 3D structure from videos, but existing methods remain slow, computationally expensive, and difficult to scale to longer sequences. We introduce a training-free approach that accelerates 4D mesh generation while improving temporal correspondence quality. Our key observation is that temporal correspondences emerge inside a 4D backbone long before its generated meshes become visually accurate. We exploit this with a general framework we call Spatio-Temporal Attention Chain which propagates information across space and time. Starting from vertices on an anchor mesh, the chain maps vertices to latent tokens. It then follows temporal correspondences in latent space, and recovers frame-specific vertices through latent-to-vertex attention. This design avoids expensive explicit matching while preserving anchor mesh details and thereby improving dynamic mesh geometry and temporal consistency. Compared to state-of-the-art, our method generates a 4D mesh in 9 seconds, achieving a 13times speedup while producing higher-quality results. Moreover, our approach scales to videos up to 16times longer without degrading mesh quality. Beyond generation, the improved correspondences enable competitive zero-shot performance on two downstream tasks: 2D object tracking and 4D tracking. We further show that our framework enables reliable camera estimation, a capability not supported by prior 4D mesh generation methods.