从视频扩散先验中学习时间一致的视频深度
Learning Temporally Consistent Video Depth from Video Diffusion Priors
June 3, 2024
作者: Jiahao Shao, Yuanbo Yang, Hongyu Zhou, Youmin Zhang, Yujun Shen, Matteo Poggi, Yiyi Liao
cs.AI
摘要
本工作解决了视频深度估计的挑战,期望不仅实现逐帧准确性,更重要的是跨帧一致性。我们并非直接从头开始开发深度估计器,而是将预测任务重新构建为条件生成问题。这使我们能够利用现有视频生成模型中嵌入的先验知识,从而降低学习难度并增强泛化能力。具体而言,我们研究如何驯服公开的稳定视频扩散(SVD),利用图像深度和视频深度数据集的混合来预测输入视频的可靠深度。我们在实证中证实,一种程序化的训练策略 - 先优化SVD的空间层,然后在保持空间层冻结的同时优化时间层 - 在空间准确性和时间一致性方面取得了最佳结果。我们进一步研究了用于对任意长视频进行推断的滑动窗口策略。我们的观察表明,在效率和性能之间存在权衡,仅一个帧的重叠就能产生良好的结果。大量实验结果证明了我们的方法ChronoDepth相对于现有替代方案的优越性,特别是在估计深度的时间一致性方面。此外,我们强调了更一致的视频深度在两个实际应用中的好处:深度条件视频生成和新视角合成。我们的项目页面位于https://jhaoshao.github.io/ChronoDepth/{此http网址}。
English
This work addresses the challenge of video depth estimation, which expects
not only per-frame accuracy but, more importantly, cross-frame consistency.
Instead of directly developing a depth estimator from scratch, we reformulate
the prediction task into a conditional generation problem. This allows us to
leverage the prior knowledge embedded in existing video generation models,
thereby reducing learn- ing difficulty and enhancing generalizability.
Concretely, we study how to tame the public Stable Video Diffusion (SVD) to
predict reliable depth from input videos using a mixture of image depth and
video depth datasets. We empirically confirm that a procedural training
strategy - first optimizing the spatial layers of SVD and then optimizing the
temporal layers while keeping the spatial layers frozen - yields the best
results in terms of both spatial accuracy and temporal consistency. We further
examine the sliding window strategy for inference on arbitrarily long videos.
Our observations indicate a trade-off between efficiency and performance, with
a one-frame overlap already producing favorable results. Extensive experimental
results demonstrate the superiority of our approach, termed ChronoDepth, over
existing alternatives, particularly in terms of the temporal consistency of the
estimated depth. Additionally, we highlight the benefits of more consistent
video depth in two practical applications: depth-conditioned video generation
and novel view synthesis. Our project page is available at
https://jhaoshao.github.io/ChronoDepth/{this http URL}.Summary
AI-Generated Summary