多视角金字塔变换器:观粗见广
Multi-view Pyramid Transformer: Look Coarser to See Broader
December 8, 2025
作者: Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park
cs.AI
摘要
我们提出多视角金字塔变换器(MVP),一种可扩展的多视角变换器架构,能够在前向传播中直接根据数十至数百张图像重建大规模3D场景。借鉴"观全局以窥全貌,察细微以辨精微"的设计理念,MVP基于两大核心原则构建:1)局部到全局的视角间层级结构,使模型视角从局部视图逐步扩展至视图组乃至完整场景;2)精细到粗略的视角内层级结构,从详细的空间表征出发,逐步聚合为紧凑的信息密集型令牌。这种双重层级结构在实现计算高效性的同时保证了表征丰富性,从而支持快速重建复杂大场景。我们在多个数据集上验证了MVP的性能,结果表明当结合3D高斯溅射作为底层3D表征方法时,该架构在保持高效率和广泛视角配置适应性的同时,实现了业界领先的泛化重建质量。
English
We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.