TUN3D：邁向從非固定視角圖像理解真實世界場景

摘要

佈局估計與3D物體檢測是室內場景理解中的兩項基礎任務。當二者結合時，能夠創建出既緊湊又語義豐富的場景空間表示。現有方法通常依賴於點雲輸入，這帶來了一個主要限制，因為大多數消費級相機缺乏深度傳感器，而僅依賴視覺數據的情況仍然更為普遍。我們通過TUN3D解決了這一問題，這是首個在多視圖圖像作為輸入的情況下，無需真實相機姿態或深度監督，就能處理真實掃描中聯合佈局估計與3D物體檢測的方法。我們的方法基於輕量級的稀疏卷積骨幹網絡，並採用了兩個專用頭部：一個用於3D物體檢測，另一個用於佈局估計，後者利用了新穎且有效的參數化牆體表示。大量實驗表明，TUN3D在三個具有挑戰性的場景理解基準測試中均達到了最先進的性能：(i) 使用真實點雲，(ii) 使用帶姿態的圖像，以及(iii) 使用無姿態的圖像。在與專門的3D物體檢測方法表現相當的同時，TUN3D在佈局估計方面取得了顯著進展，為整體室內場景理解設立了新的標杆。代碼可在https://github.com/col14m/tun3d 獲取。

English

Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .

TUN3D：邁向從非固定視角圖像理解真實世界場景

TUN3D: Towards Real-World Scene Understanding from Unposed Images

摘要

Support