π^3:可擴展的置換等變視覺幾何學習
π^3: Scalable Permutation-Equivariant Visual Geometry Learning
July 17, 2025
作者: Yifan Wang, Jianjun Zhou, Haoyi Zhu, Wenzheng Chang, Yang Zhou, Zizun Li, Junyi Chen, Jiangmiao Pang, Chunhua Shen, Tong He
cs.AI
摘要
我們介紹了pi^3,這是一種前饋神經網絡,它提供了一種新穎的視覺幾何重建方法,打破了對傳統固定參考視角的依賴。以往的方法通常將重建結果錨定在指定的視點上,這種歸納偏置在參考視點不理想時可能導致不穩定和失敗。與之相反,pi^3採用了一種完全置換等變的架構來預測仿射不變的相機姿態和尺度不變的局部點雲圖,而無需任何參考框架。這一設計使我們的模型對輸入順序具有內在的魯棒性,並且具有高度的可擴展性。這些優勢使得我們這種簡單且無偏置的方法在多種任務上達到了最先進的性能,包括相機姿態估計、單目/視頻深度估計以及密集點雲圖重建。代碼和模型均已公開提供。
English
We introduce pi^3, a feed-forward neural network that offers a novel
approach to visual geometry reconstruction, breaking the reliance on a
conventional fixed reference view. Previous methods often anchor their
reconstructions to a designated viewpoint, an inductive bias that can lead to
instability and failures if the reference is suboptimal. In contrast, pi^3
employs a fully permutation-equivariant architecture to predict
affine-invariant camera poses and scale-invariant local point maps without any
reference frames. This design makes our model inherently robust to input
ordering and highly scalable. These advantages enable our simple and bias-free
approach to achieve state-of-the-art performance on a wide range of tasks,
including camera pose estimation, monocular/video depth estimation, and dense
point map reconstruction. Code and models are publicly available.