多視角金字塔轉換器:觀宏見遠
Multi-view Pyramid Transformer: Look Coarser to See Broader
December 8, 2025
作者: Gyeongjin Kang, Seungkwon Yang, Seungtae Nam, Younggeun Lee, Jungwoo Kim, Eunbyung Park
cs.AI
摘要
我們提出多視角金字塔轉換器(MVP),這是一種可擴展的多視角轉換器架構,能夠在單次前向傳遞中直接從數十至數百張圖像重建大規模3D場景。基於「觀全局以見全貌,察細微以辨細節」的理念,MVP建立在兩大核心設計原則之上:1)局部到全局的視間層級結構,使模型視角從局部視圖逐步擴展至群組,最終覆蓋完整場景;2)細粒度到粗粒度的視內層級結構,從精細的空間表徵出發,逐步聚合為緊湊且信息密集的標記。這種雙重層級結構兼具計算效率與表徵豐富性,能實現大型複雜場景的快速重建。我們在多個數據集上驗證了MVP的性能,結果表明當其與3D高斯潑濺作為底層3D表徵相結合時,不僅在通用化重建品質上達到最先進水平,還能在多種視角配置下保持高效性與可擴展性。
English
We propose Multi-view Pyramid Transformer (MVP), a scalable multi-view transformer architecture that directly reconstructs large 3D scenes from tens to hundreds of images in a single forward pass. Drawing on the idea of ``looking broader to see the whole, looking finer to see the details," MVP is built on two core design principles: 1) a local-to-global inter-view hierarchy that gradually broadens the model's perspective from local views to groups and ultimately the full scene, and 2) a fine-to-coarse intra-view hierarchy that starts from detailed spatial representations and progressively aggregates them into compact, information-dense tokens. This dual hierarchy achieves both computational efficiency and representational richness, enabling fast reconstruction of large and complex scenes. We validate MVP on diverse datasets and show that, when coupled with 3D Gaussian Splatting as the underlying 3D representation, it achieves state-of-the-art generalizable reconstruction quality while maintaining high efficiency and scalability across a wide range of view configurations.