ChatPaper.aiChatPaper

TUN3D:面向无姿态图像的真实场景理解

TUN3D: Towards Real-World Scene Understanding from Unposed Images

September 23, 2025
作者: Anton Konushin, Nikita Drozdov, Bulat Gabdullin, Alexey Zakharov, Anna Vorontsova, Danila Rukhovich, Maksim Kolodiazhnyi
cs.AI

摘要

布局估计与三维物体检测是室内场景理解中的两项基础任务。当二者结合时,能够构建出紧凑且语义丰富的场景空间表示。现有方法通常依赖点云输入,这带来了一个主要限制,因为大多数消费级相机缺乏深度传感器,而纯视觉数据仍然更为普遍。我们通过TUN3D解决了这一问题,这是首个在真实扫描中处理联合布局估计与三维物体检测的方法,仅需多视角图像作为输入,且无需真实相机姿态或深度监督。我们的方法基于轻量级稀疏卷积骨干网络,并采用两个专用头:一个用于三维物体检测,另一个用于布局估计,后者利用了新颖且有效的参数化墙体表示。大量实验表明,TUN3D在三个具有挑战性的场景理解基准测试中均达到了最先进的性能:(i)使用真实点云,(ii)使用带姿态的图像,以及(iii)使用无姿态图像。在性能上与专门的三维物体检测方法相当的同时,TUN3D在布局估计方面取得了显著进展,为整体室内场景理解设立了新标杆。代码可在https://github.com/col14m/tun3d 获取。
English
Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .
PDF142September 29, 2025