TUN3D: 비포즈 이미지에서 실세계 장면 이해를 향하여

초록

레이아웃 추정과 3D 객체 탐지는 실내 장면 이해의 두 가지 기본적인 과제입니다. 이 두 가지를 결합하면 장면의 간결하면서도 의미론적으로 풍부한 공간 표현을 생성할 수 있습니다. 기존의 접근 방식은 일반적으로 포인트 클라우드 입력에 의존하는데, 이는 대부분의 소비자용 카메라가 깊이 센서를 갖추지 않았고 시각 데이터만으로는 여전히 훨씬 더 일반적이기 때문에 주요한 한계로 작용합니다. 우리는 이 문제를 TUN3D를 통해 해결합니다. TUN3D는 다중 뷰 이미지를 입력으로 받아 실제 스캔에서 레이아웃 추정과 3D 객체 탐지를 동시에 수행하는 최초의 방법으로, 지상 실측 카메라 포즈나 깊이 감독이 필요하지 않습니다. 우리의 접근 방식은 경량의 희소 컨볼루션 백본을 기반으로 하며, 3D 객체 탐지와 레이아웃 추정을 위한 두 개의 전용 헤드를 사용합니다. 여기서는 새롭고 효과적인 파라메트릭 벽 표현을 활용합니다. 광범위한 실험을 통해 TUN3D는 (i) 지상 실측 포인트 클라우드, (ii) 포즈가 지정된 이미지, (iii) 포즈가 지정되지 않은 이미지를 사용한 세 가지 도전적인 장면 이해 벤치마크에서 최첨단 성능을 달성함을 보여줍니다. TUN3D는 전문화된 3D 객체 탐지 방법과 동등한 성능을 보이면서도 레이아웃 추정을 크게 발전시켜, 전체적인 실내 장면 이해에서 새로운 벤치마크를 설정합니다. 코드는 https://github.com/col14m/tun3d에서 확인할 수 있습니다.

English

Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .

TUN3D: 비포즈 이미지에서 실세계 장면 이해를 향하여

TUN3D: Towards Real-World Scene Understanding from Unposed Images

초록

Support