V-DPM: 4D-videoreconstructie met dynamische puntenkaarten

Samenvatting

Krachtige 3D-representaties zoals DUSt3R-invariante puntenkaarten, die 3D-vorm en cameraparameters coderen, hebben voorwaartse 3D-reconstructie aanzienlijk vooruitgeholpen. Hoewel puntenkaarten uitgaan van statische scènes, breiden Dynamic Point Maps (DPM's) dit concept uit naar dynamische 3D-inhoud door ook scènebeweging te representeren. Bestaande DPM's zijn echter beperkt tot beeldparen en vereisen, net als DUSt3R, nabewerking via optimalisatie wanneer meer dan twee viewpoints betrokken zijn. Wij beargumenteren dat DPM's nuttiger zijn wanneer ze op video's worden toegepast en introduceren V-DPM om dit aan te tonen. Ten eerste tonen we hoe DPM's voor video-input kunnen worden geformuleerd om de representatiekracht te maximaliseren, neurale voorspelling te vergemakkelijken en hergebruik van vooraf getrainde modellen mogelijk te maken. Ten tweede implementeren we deze ideeën bovenop VGGT, een recente en krachtige 3D-reconstructor. Hoewel VGGT werd getraind op statische scènes, tonen we aan dat een bescheiden hoeveelheid synthetische data voldoende is om het aan te passen tot een effectieve V-DPM-voorspeller. Onze aanpak behaalt state-of-the-art prestaties in 3D- en 4D-reconstructie voor dynamische scènes. In tegenstelling tot recente dynamische extensies van VGGT zoals P3, reconstrueren DPM's niet alleen dynamische diepte, maar ook de volledige 3D-beweging van elk punt in de scène.

English

Powerful 3D representations such as DUSt3R invariant point maps, which encode 3D shape and camera parameters, have significantly advanced feed forward 3D reconstruction. While point maps assume static scenes, Dynamic Point Maps (DPMs) extend this concept to dynamic 3D content by additionally representing scene motion. However, existing DPMs are limited to image pairs and, like DUSt3R, require post processing via optimization when more than two views are involved. We argue that DPMs are more useful when applied to videos and introduce V-DPM to demonstrate this. First, we show how to formulate DPMs for video input in a way that maximizes representational power, facilitates neural prediction, and enables reuse of pretrained models. Second, we implement these ideas on top of VGGT, a recent and powerful 3D reconstructor. Although VGGT was trained on static scenes, we show that a modest amount of synthetic data is sufficient to adapt it into an effective V-DPM predictor. Our approach achieves state of the art performance in 3D and 4D reconstruction for dynamic scenes. In particular, unlike recent dynamic extensions of VGGT such as P3, DPMs recover not only dynamic depth but also the full 3D motion of every point in the scene.

V-DPM: 4D-videoreconstructie met dynamische puntenkaarten

V-DPM: 4D Video Reconstruction with Dynamic Point Maps

Samenvatting

Support