iLRM:一種迭代式大型三維重建模型
iLRM: An Iterative Large 3D Reconstruction Model
July 31, 2025
作者: Gyeongjin Kang, Seungtae Nam, Xiangyu Sun, Sameh Khamis, Abdelrahman Mohamed, Eunbyung Park
cs.AI
摘要
前馈式三维建模已成为实现快速高质量三维重建的一种有前景的方法。特别是直接生成显式三维表示(如三维高斯溅射)因其快速且高质量的渲染能力以及广泛的应用而受到极大关注。然而,许多基于Transformer架构的先进方法存在严重的可扩展性问题,因为它们依赖于多输入视图图像标记之间的完全注意力机制,导致随着视图数量或图像分辨率的增加,计算成本急剧上升。为了实现可扩展且高效的前馈式三维重建,我们提出了一种迭代式大型三维重建模型(iLRM),该模型通过迭代优化机制生成三维高斯表示,并遵循三个核心原则:(1) 将场景表示与输入视图图像解耦,以实现紧凑的三维表示;(2) 将完全注意力的多视图交互分解为两阶段注意力方案,以降低计算成本;(3) 在每一层注入高分辨率信息,以实现高保真重建。在RE10K和DL3DV等广泛使用的数据集上的实验结果表明,iLRM在重建质量和速度上均优于现有方法。值得注意的是,iLRM展现出卓越的可扩展性,在相同计算成本下,通过有效利用更多输入视图,显著提高了重建质量。
English
Feed-forward 3D modeling has emerged as a promising approach for rapid and
high-quality 3D reconstruction. In particular, directly generating explicit 3D
representations, such as 3D Gaussian splatting, has attracted significant
attention due to its fast and high-quality rendering, as well as numerous
applications. However, many state-of-the-art methods, primarily based on
transformer architectures, suffer from severe scalability issues because they
rely on full attention across image tokens from multiple input views, resulting
in prohibitive computational costs as the number of views or image resolution
increases. Toward a scalable and efficient feed-forward 3D reconstruction, we
introduce an iterative Large 3D Reconstruction Model (iLRM) that generates 3D
Gaussian representations through an iterative refinement mechanism, guided by
three core principles: (1) decoupling the scene representation from input-view
images to enable compact 3D representations; (2) decomposing fully-attentional
multi-view interactions into a two-stage attention scheme to reduce
computational costs; and (3) injecting high-resolution information at every
layer to achieve high-fidelity reconstruction. Experimental results on widely
used datasets, such as RE10K and DL3DV, demonstrate that iLRM outperforms
existing methods in both reconstruction quality and speed. Notably, iLRM
exhibits superior scalability, delivering significantly higher reconstruction
quality under comparable computational cost by efficiently leveraging a larger
number of input views.