无需视频模型的视频深度
Video Depth without Video Models
November 28, 2024
作者: Bingxin Ke, Dominik Narnhofer, Shengyu Huang, Lei Ke, Torben Peters, Katerina Fragkiadaki, Anton Obukhov, Konrad Schindler
cs.AI
摘要
视频深度估计通过推断每一帧的密集深度,将单目视频剪辑提升到3D。最近单图深度估计的进展,由大型基础模型的兴起和合成训练数据的使用带来,引发了对视频深度的重新关注。然而,简单地将单图深度估计器应用于视频的每一帧会忽略时间连续性,不仅会导致闪烁,还可能在摄像机运动引起深度范围突变时出现问题。一个明显且合理的解决方案是基于视频基础模型构建,但这些模型也存在各自的局限性;包括昂贵的训练和推断、不完美的3D一致性,以及针对固定长度(短)输出的拼接例程。我们退一步,展示如何将单图潜扩散模型(LDM)转化为一流的视频深度估计器。我们的模型名为RollingDepth,主要包括两个要素:(i)源自单图LDM的多帧深度估计器,将非常短的视频片段(通常是帧三元组)映射到深度片段。 (ii)一个稳健的、基于优化的配准算法,将以不同帧率采样的深度片段最佳地组装回一致的视频。RollingDepth能够高效处理数百帧的长视频,并提供比专用视频深度估计器和性能优越的单帧模型更准确的深度视频。项目页面:rollingdepth.github.io。
English
Video depth estimation lifts monocular video clips to 3D by inferring dense
depth at every frame. Recent advances in single-image depth estimation, brought
about by the rise of large foundation models and the use of synthetic training
data, have fueled a renewed interest in video depth. However, naively applying
a single-image depth estimator to every frame of a video disregards temporal
continuity, which not only leads to flickering but may also break when camera
motion causes sudden changes in depth range. An obvious and principled solution
would be to build on top of video foundation models, but these come with their
own limitations; including expensive training and inference, imperfect 3D
consistency, and stitching routines for the fixed-length (short) outputs. We
take a step back and demonstrate how to turn a single-image latent diffusion
model (LDM) into a state-of-the-art video depth estimator. Our model, which we
call RollingDepth, has two main ingredients: (i) a multi-frame depth estimator
that is derived from a single-image LDM and maps very short video snippets
(typically frame triplets) to depth snippets. (ii) a robust, optimization-based
registration algorithm that optimally assembles depth snippets sampled at
various different frame rates back into a consistent video. RollingDepth is
able to efficiently handle long videos with hundreds of frames and delivers
more accurate depth videos than both dedicated video depth estimators and
high-performing single-frame models. Project page: rollingdepth.github.io.Summary
AI-Generated Summary