GEN3C：具備精確相機控制的3D感知世界一致性影片生成

摘要

我們推出GEN3C，這是一個具備精確相機控制與時間上三維一致性的生成式影片模型。現有的影片模型雖已能生成逼真影片，但往往利用的三維資訊有限，導致諸如物體突然出現或消失等不一致現象。即便實現了相機控制，其精確度也不高，因為相機參數僅作為神經網路的輸入，模型需自行推斷影片如何依賴於相機。與此不同，GEN3C由一個三維緩存引導：該緩存是通過預測種子圖像或先前生成幀的像素級深度獲得的點雲。在生成下一幀時，GEN3C以用戶提供的新相機軌跡對三維緩存進行二維渲染為條件。關鍵在於，這意味著GEN3C既無需記住先前生成的內容，也不必從相機姿態推斷圖像結構。相反，模型可以將其全部生成能力集中於先前未觀察到的區域，並將場景狀態推進至下一幀。我們的結果顯示，相比先前工作，GEN3C實現了更精確的相機控制，並在稀疏視角新視角合成中取得了領先的成果，即便在駕駛場景和單目動態影片等挑戰性設置下也是如此。最佳效果請觀看影片。訪問我們的網頁了解更多！https://research.nvidia.com/labs/toronto-ai/GEN3C/

English

We present GEN3C, a generative video model with precise Camera Control and temporal 3D Consistency. Prior video models already generate realistic videos, but they tend to leverage little 3D information, leading to inconsistencies, such as objects popping in and out of existence. Camera control, if implemented at all, is imprecise, because camera parameters are mere inputs to the neural network which must then infer how the video depends on the camera. In contrast, GEN3C is guided by a 3D cache: point clouds obtained by predicting the pixel-wise depth of seed images or previously generated frames. When generating the next frames, GEN3C is conditioned on the 2D renderings of the 3D cache with the new camera trajectory provided by the user. Crucially, this means that GEN3C neither has to remember what it previously generated nor does it have to infer the image structure from the camera pose. The model, instead, can focus all its generative power on previously unobserved regions, as well as advancing the scene state to the next frame. Our results demonstrate more precise camera control than prior work, as well as state-of-the-art results in sparse-view novel view synthesis, even in challenging settings such as driving scenes and monocular dynamic video. Results are best viewed in videos. Check out our webpage! https://research.nvidia.com/labs/toronto-ai/GEN3C/