INVE: インタラクティブ・ニューラル・ビデオ編集

要旨

本論文では、Interactive Neural Video Editing (INVE) を提案します。これはリアルタイムのビデオ編集ソリューションであり、まばらなフレーム編集をビデオクリップ全体に一貫して伝播させることで、ビデオ編集プロセスを支援します。本手法は、最近のLayered Neural Atlas (LNA) の研究に着想を得ています。しかし、LNAには2つの大きな欠点があります：(1) インタラクティブ編集には処理速度が遅すぎること、(2) 直接的なフレーム編集や剛体テクスチャ追跡など、一部の編集ユースケースに対するサポートが不十分であることです。これらの課題に対処するため、我々はハッシュグリッドエンコーディングを活用した高効率なネットワークアーキテクチャを採用し、処理速度を大幅に改善しました。さらに、画像とアトラスの間の双方向関数を学習し、ベクトル化編集を導入することで、アトラスとフレームの両方において、より多様な編集を可能にしました。LNAと比較して、INVEは学習と推論時間を5分の1に短縮し、LNAでは不可能だった様々なビデオ編集操作をサポートします。包括的な定量的・定性的分析を通じて、INVEがインタラクティブビデオ編集においてLNAを凌駕する優位性を実証し、その数多くの利点と性能向上を強調します。ビデオ結果については、https://gabriel-huang.github.io/inve/ をご覧ください。

English

We present Interactive Neural Video Editing (INVE), a real-time video editing solution, which can assist the video editing process by consistently propagating sparse frame edits to the entire video clip. Our method is inspired by the recent work on Layered Neural Atlas (LNA). LNA, however, suffers from two major drawbacks: (1) the method is too slow for interactive editing, and (2) it offers insufficient support for some editing use cases, including direct frame editing and rigid texture tracking. To address these challenges we leverage and adopt highly efficient network architectures, powered by hash-grids encoding, to substantially improve processing speed. In addition, we learn bi-directional functions between image-atlas and introduce vectorized editing, which collectively enables a much greater variety of edits in both the atlas and the frames directly. Compared to LNA, our INVE reduces the learning and inference time by a factor of 5, and supports various video editing operations that LNA cannot. We showcase the superiority of INVE over LNA in interactive video editing through a comprehensive quantitative and qualitative analysis, highlighting its numerous advantages and improved performance. For video results, please see https://gabriel-huang.github.io/inve/

INVE: インタラクティブ・ニューラル・ビデオ編集

INVE: Interactive Neural Video Editing

要旨

Support