透過局部隨機存取序列建模實現3D場景理解

摘要

從單一圖像進行三維場景理解是計算機視覺中的一個關鍵問題，在圖形學、增強現實和機器人等領域具有眾多下游應用。儘管基於擴散的建模方法已顯示出潛力，但它們在保持物體和場景一致性方面往往面臨挑戰，尤其是在複雜的真實世界場景中。為解決這些限制，我們提出了一種稱為局部隨機存取序列（LRAS）建模的自回歸生成方法，該方法利用局部塊量化與隨機排序的序列生成。通過將光流作為三維場景編輯的中間表示，我們的實驗表明，LRAS在新視角合成和三維物體操控能力上達到了最先進的水平。此外，我們展示了該框架通過簡單的序列設計修改，自然延伸至自監督深度估計。在多項三維場景理解任務中實現強勁性能的同時，LRAS為構建下一代三維視覺模型提供了一個統一且有效的框架。

English

3D scene understanding from single images is a pivotal problem in computer vision with numerous downstream applications in graphics, augmented reality, and robotics. While diffusion-based modeling approaches have shown promise, they often struggle to maintain object and scene consistency, especially in complex real-world scenarios. To address these limitations, we propose an autoregressive generative approach called Local Random Access Sequence (LRAS) modeling, which uses local patch quantization and randomly ordered sequence generation. By utilizing optical flow as an intermediate representation for 3D scene editing, our experiments demonstrate that LRAS achieves state-of-the-art novel view synthesis and 3D object manipulation capabilities. Furthermore, we show that our framework naturally extends to self-supervised depth estimation through a simple modification of the sequence design. By achieving strong performance on multiple 3D scene understanding tasks, LRAS provides a unified and effective framework for building the next generation of 3D vision models.

透過局部隨機存取序列建模實現3D場景理解

3D Scene Understanding Through Local Random Access Sequence Modeling

摘要

Support