ChatPaper.aiChatPaper

高效重构动态场景:逐帧D4RT实时重建技术

Efficiently Reconstructing Dynamic Scenes One D4RT at a Time

December 9, 2025
作者: Chuhan Zhang, Guillaume Le Moing, Skanda Koppula, Ignacio Rocco, Liliane Momeni, Junyu Xie, Shuyang Sun, Rahul Sukthankar, Joëlle K Barral, Raia Hadsell, Zoubin Ghahramani, Andrew Zisserman, Junlin Zhang, Mehdi SM Sajjadi
cs.AI

摘要

理解并重建视频中动态场景的复杂几何结构与运动规律,始终是计算机视觉领域面临的重大挑战。本文提出D4RT——一种简洁而高效的前馈模型,旨在以创新方式解决这一难题。该模型采用统一的Transformer架构,能够从单段视频中联合推断深度信息、时空对应关系及完整相机参数。其核心创新在于引入了一种新颖的查询机制,既规避了密集逐帧解码的沉重计算负担,又避免了管理多个任务专用解码器的复杂性。我们的解码接口使模型能够独立灵活地探查时空任意点的三维坐标,最终形成一种轻量化且高度可扩展的方法,实现显著高效的训练与推理。实验表明,该方法在各类4D重建任务中均超越现有技术,确立了新的性能标杆。动态演示结果请参阅项目网页:https://d4rt-paper.github.io/。
English
Understanding and reconstructing the complex geometry and motion of dynamic scenes from video remains a formidable challenge in computer vision. This paper introduces D4RT, a simple yet powerful feedforward model designed to efficiently solve this task. D4RT utilizes a unified transformer architecture to jointly infer depth, spatio-temporal correspondence, and full camera parameters from a single video. Its core innovation is a novel querying mechanism that sidesteps the heavy computation of dense, per-frame decoding and the complexity of managing multiple, task-specific decoders. Our decoding interface allows the model to independently and flexibly probe the 3D position of any point in space and time. The result is a lightweight and highly scalable method that enables remarkably efficient training and inference. We demonstrate that our approach sets a new state of the art, outperforming previous methods across a wide spectrum of 4D reconstruction tasks. We refer to the project webpage for animated results: https://d4rt-paper.github.io/.
PDF31December 11, 2025