オムニローム：長尺パノラマ動画生成による世界彷徨

要旨

近年，映像生成モデルを用いたシーン構築手法への研究関心が高まっている。しかし、既存手法の多くは限定的な観測しか合成しない透視図法ベースのビデオモデルに依存しており、完全性と大域的一貫性に課題を残している。本論文では、パノラマ表現が持つフレーム単位での豊富なシーンカバレッジと、本来備わっている長期的な時空間的一貫性を活用し、長期的なシーン遊覧を可能にする制御可能なパノラマビデオ生成フレームワーク「OmniRoam」を提案する。本フレームワークはまずプレビュー段階において、軌道制御されたビデオ生成モデルが入力画像またはビデオからシーンの簡易概要を生成する。続くリファイン段階では、このビデオを時間的に拡張し空間的にアップサンプリングすることで、長時間かつ高解像度のビデオを生成し、高精細な仮想世界遊覧を実現する。モデル学習のために、合成映像と実写映像の両方を組み込んだ2種類のパノラマビデオデータセットを構築した。実験により、本フレームワークが視覚的品質、制御性、長期的シーン一貫性において、定量的・定性的に既存の最先端手法を一貫して上回ることを示す。さらに、リアルタイムビデオ生成や3D再構成を含む本フレームワークの複数の応用可能性も提示する。コードはhttps://github.com/yuhengliu02/OmniRoamで公開されている。

English

Modeling scenes using video generation models has garnered growing research interest in recent years. However, most existing approaches rely on perspective video models that synthesize only limited observations of a scene, leading to issues of completeness and global consistency. We propose OmniRoam, a controllable panoramic video generation framework that exploits the rich per-frame scene coverage and inherent long-term spatial and temporal consistency of panoramic representation, enabling long-horizon scene wandering. Our framework begins with a preview stage, where a trajectory-controlled video generation model creates a quick overview of the scene from a given input image or video. Then, in the refine stage, this video is temporally extended and spatially upsampled to produce long-range, high-resolution videos, thus enabling high-fidelity world wandering. To train our model, we introduce two panoramic video datasets that incorporate both synthetic and real-world captured videos. Experiments show that our framework consistently outperforms state-of-the-art methods in terms of visual quality, controllability, and long-term scene consistency, both qualitatively and quantitatively. We further showcase several extensions of this framework, including real-time video generation and 3D reconstruction. Code is available at https://github.com/yuhengliu02/OmniRoam.

オムニローム：長尺パノラマ動画生成による世界彷徨

OmniRoam: World Wandering via Long-Horizon Panoramic Video Generation

要旨

Support