街景：使用自回歸視頻擴散生成大規模一致街景

摘要

我們提出了一種方法，用於生成街景 - 通過即時合成的城市規模場景的一系列視圖。我們的生成受語言輸入（例如城市名稱、天氣）以及包含所需軌跡的底層地圖/佈局的條件限制。與最近用於視頻生成或3D視圖合成的模型相比，我們的方法可以擴展到更長範圍的相機軌跡，跨越多個城市街區，同時保持視覺質量和一致性。為了實現這一目標，我們基於最近關於視頻擴散的研究，該研究在可以輕鬆擴展到長序列的自回歸框架中使用，我們引入了一種新的時間插補方法，防止我們的自回歸方法偏離現實城市圖像的分佈。我們在Google Street View的引人入勝的數據提供的圖像以及上下文地圖數據上訓練我們的Streetscapes系統，這使用戶可以生成基於任何所需城市佈局的城市視圖，並具有可控相機姿勢。請在我們的項目頁面https://boyangdeng.com/streetscapes 查看更多結果。

English

We present a method for generating Streetscapes-long sequences of views through an on-the-fly synthesized city-scale scene. Our generation is conditioned by language input (e.g., city name, weather), as well as an underlying map/layout hosting the desired trajectory. Compared to recent models for video generation or 3D view synthesis, our method can scale to much longer-range camera trajectories, spanning several city blocks, while maintaining visual quality and consistency. To achieve this goal, we build on recent work on video diffusion, used within an autoregressive framework that can easily scale to long sequences. In particular, we introduce a new temporal imputation method that prevents our autoregressive approach from drifting from the distribution of realistic city imagery. We train our Streetscapes system on a compelling source of data-posed imagery from Google Street View, along with contextual map data-which allows users to generate city views conditioned on any desired city layout, with controllable camera poses. Please see more results at our project page at https://boyangdeng.com/streetscapes.

街景：使用自回歸視頻擴散生成大規模一致街景

Streetscapes: Large-scale Consistent Street View Generation Using Autoregressive Video Diffusion

摘要

Support