Fast-FoundationStereo:实时零样本立体匹配算法
Fast-FoundationStereo: Real-Time Zero-Shot Stereo Matching
December 11, 2025
作者: Bowen Wen, Shaurya Dewan, Stan Birchfield
cs.AI
摘要
立體視覺基礎模型雖能實現強大的零樣本泛化能力,但其計算成本仍難以滿足實時應用需求。而高效立體架構為追求速度犧牲了魯棒性,且需耗費大量資源進行逐領域微調。為彌合這一差距,我們提出Fast-FoundationStereo架構系列,首次在實時幀率下實現了強勁的零樣本泛化性能。我們採用分治加速策略,包含三大核心組件:(1)通過知識蒸餾將混合骨幹網絡壓縮為單一高效學生模型;(2)基於分塊神經架構搜索自動發現延遲預算下的最優代價濾波設計,將搜索複雜度呈指數級降低;(3)採用結構化剪枝消除迭代優化模塊中的冗餘。此外,我們構建了自動偽標註流水線,用於篩選140萬組真實場景立體圖像對,以補充合成訓練數據並促進知識蒸餾。最終模型在保持與FoundationStereo相近零樣本精度的同時,運行速度提升逾10倍,由此確立了實時立體視覺方法的新標杆。項目頁面:https://nvlabs.github.io/Fast-FoundationStereo/
English
Stereo foundation models achieve strong zero-shot generalization but remain computationally prohibitive for real-time applications. Efficient stereo architectures, on the other hand, sacrifice robustness for speed and require costly per-domain fine-tuning. To bridge this gap, we present Fast-FoundationStereo, a family of architectures that achieve, for the first time, strong zero-shot generalization at real-time frame rate. We employ a divide-and-conquer acceleration strategy with three components: (1) knowledge distillation to compress the hybrid backbone into a single efficient student; (2) blockwise neural architecture search for automatically discovering optimal cost filtering designs under latency budgets, reducing search complexity exponentially; and (3) structured pruning for eliminating redundancy in the iterative refinement module. Furthermore, we introduce an automatic pseudo-labeling pipeline used to curate 1.4M in-the-wild stereo pairs to supplement synthetic training data and facilitate knowledge distillation. The resulting model can run over 10x faster than FoundationStereo while closely matching its zero-shot accuracy, thus establishing a new state-of-the-art among real-time methods. Project page: https://nvlabs.github.io/Fast-FoundationStereo/