ビデオ拡散事前分布から時間的一貫性のあるビデオ深度を学習する

要旨

本研究は、ビデオ深度推定の課題に取り組むものであり、単なるフレームごとの精度だけでなく、より重要なクロスフレームの一貫性が求められる。深度推定器を一から開発するのではなく、予測タスクを条件付き生成問題として再定式化することで、既存のビデオ生成モデルに埋め込まれた事前知識を活用し、学習の難易度を低減し汎化性能を向上させる。具体的には、公開されているStable Video Diffusion（SVD）を活用し、画像深度とビデオ深度のデータセットを組み合わせて入力ビデオから信頼性の高い深度を予測する方法を検討する。実証的に、手順を踏んだトレーニング戦略——まずSVDの空間層を最適化し、次に空間層を凍結した状態で時間層を最適化する——が、空間精度と時間的一貫性の両面で最良の結果をもたらすことを確認した。さらに、任意の長さのビデオに対する推論におけるスライディングウィンドウ戦略を検証し、効率と性能のトレードオフを観察した。1フレームのオーバーラップでも良好な結果が得られることがわかった。広範な実験結果により、我々のアプローチ「ChronoDepth」が既存の手法を凌駕し、特に推定深度の時間的一貫性において優れていることが示された。さらに、より一貫性のあるビデオ深度がもたらす利点を、深度条件付きビデオ生成と新規視点合成という2つの実用的な応用において強調する。プロジェクトページはhttps://jhaoshao.github.io/ChronoDepth/{this http URL}で公開されている。

English

This work addresses the challenge of video depth estimation, which expects not only per-frame accuracy but, more importantly, cross-frame consistency. Instead of directly developing a depth estimator from scratch, we reformulate the prediction task into a conditional generation problem. This allows us to leverage the prior knowledge embedded in existing video generation models, thereby reducing learn- ing difficulty and enhancing generalizability. Concretely, we study how to tame the public Stable Video Diffusion (SVD) to predict reliable depth from input videos using a mixture of image depth and video depth datasets. We empirically confirm that a procedural training strategy - first optimizing the spatial layers of SVD and then optimizing the temporal layers while keeping the spatial layers frozen - yields the best results in terms of both spatial accuracy and temporal consistency. We further examine the sliding window strategy for inference on arbitrarily long videos. Our observations indicate a trade-off between efficiency and performance, with a one-frame overlap already producing favorable results. Extensive experimental results demonstrate the superiority of our approach, termed ChronoDepth, over existing alternatives, particularly in terms of the temporal consistency of the estimated depth. Additionally, we highlight the benefits of more consistent video depth in two practical applications: depth-conditioned video generation and novel view synthesis. Our project page is available at https://jhaoshao.github.io/ChronoDepth/{this http URL}.

ビデオ拡散事前分布から時間的一貫性のあるビデオ深度を学習する

Learning Temporally Consistent Video Depth from Video Diffusion Priors

要旨

Support