TUN3D：非固定画像からの実世界シーン理解に向けて

要旨

レイアウト推定と3D物体検出は、室内シーン理解における2つの基本的なタスクである。これらを組み合わせることで、コンパクトでありながら意味的に豊かな空間表現の作成が可能となる。既存の手法は通常、点群入力を前提としているが、これはほとんどの民生用カメラが深度センサーを備えていないことや、視覚データのみが依然として主流であることから、大きな制約となっている。本研究では、この課題に対処するため、TUN3Dを提案する。TUN3Dは、マルチビュー画像を入力として与えられた実スキャンにおいて、レイアウト推定と3D物体検出を同時に行う初めての手法であり、真値のカメラポーズや深度の教師信号を必要としない。本手法は、軽量なスパース畳み込みバックボーンを基盤とし、3D物体検出とレイアウト推定のための2つの専用ヘッドを採用している。特に、レイアウト推定では、新規かつ効果的なパラメトリックな壁表現を活用している。広範な実験により、TUN3Dは、以下の3つの挑戦的なシーン理解ベンチマークにおいて、最先端の性能を達成することが示された：(i) 真値の点群を使用する場合、(ii) ポーズ付き画像を使用する場合、(iii) ポーズなし画像を使用する場合。TUN3Dは、専門的な3D物体検出手法と同等の性能を発揮しつつ、レイアウト推定を大幅に進化させ、包括的な室内シーン理解において新たなベンチマークを確立した。コードはhttps://github.com/col14m/tun3dで公開されている。

English

Layout estimation and 3D object detection are two fundamental tasks in indoor scene understanding. When combined, they enable the creation of a compact yet semantically rich spatial representation of a scene. Existing approaches typically rely on point cloud input, which poses a major limitation since most consumer cameras lack depth sensors and visual-only data remains far more common. We address this issue with TUN3D, the first method that tackles joint layout estimation and 3D object detection in real scans, given multi-view images as input, and does not require ground-truth camera poses or depth supervision. Our approach builds on a lightweight sparse-convolutional backbone and employs two dedicated heads: one for 3D object detection and one for layout estimation, leveraging a novel and effective parametric wall representation. Extensive experiments show that TUN3D achieves state-of-the-art performance across three challenging scene understanding benchmarks: (i) using ground-truth point clouds, (ii) using posed images, and (iii) using unposed images. While performing on par with specialized 3D object detection methods, TUN3D significantly advances layout estimation, setting a new benchmark in holistic indoor scene understanding. Code is available at https://github.com/col14m/tun3d .

TUN3D：非固定画像からの実世界シーン理解に向けて

TUN3D: Towards Real-World Scene Understanding from Unposed Images

要旨

Support