无需视频模型的视频深度

摘要

视频深度估计通过推断每一帧的密集深度，将单目视频剪辑提升到3D。最近单图深度估计的进展，由大型基础模型的兴起和合成训练数据的使用带来，引发了对视频深度的重新关注。然而，简单地将单图深度估计器应用于视频的每一帧会忽略时间连续性，不仅会导致闪烁，还可能在摄像机运动引起深度范围突变时出现问题。一个明显且合理的解决方案是基于视频基础模型构建，但这些模型也存在各自的局限性；包括昂贵的训练和推断、不完美的3D一致性，以及针对固定长度（短）输出的拼接例程。我们退一步，展示如何将单图潜扩散模型（LDM）转化为一流的视频深度估计器。我们的模型名为RollingDepth，主要包括两个要素：（i）源自单图LDM的多帧深度估计器，将非常短的视频片段（通常是帧三元组）映射到深度片段。（ii）一个稳健的、基于优化的配准算法，将以不同帧率采样的深度片段最佳地组装回一致的视频。RollingDepth能够高效处理数百帧的长视频，并提供比专用视频深度估计器和性能优越的单帧模型更准确的深度视频。项目页面：rollingdepth.github.io。

English

Video depth estimation lifts monocular video clips to 3D by inferring dense depth at every frame. Recent advances in single-image depth estimation, brought about by the rise of large foundation models and the use of synthetic training data, have fueled a renewed interest in video depth. However, naively applying a single-image depth estimator to every frame of a video disregards temporal continuity, which not only leads to flickering but may also break when camera motion causes sudden changes in depth range. An obvious and principled solution would be to build on top of video foundation models, but these come with their own limitations; including expensive training and inference, imperfect 3D consistency, and stitching routines for the fixed-length (short) outputs. We take a step back and demonstrate how to turn a single-image latent diffusion model (LDM) into a state-of-the-art video depth estimator. Our model, which we call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator that is derived from a single-image LDM and maps very short video snippets (typically frame triplets) to depth snippets. (ii) a robust, optimization-based registration algorithm that optimally assembles depth snippets sampled at various different frame rates back into a consistent video. RollingDepth is able to efficiently handle long videos with hundreds of frames and delivers more accurate depth videos than both dedicated video depth estimators and high-performing single-frame models. Project page: rollingdepth.github.io.

无需视频模型的视频深度

Video Depth without Video Models

摘要

Support