Lite Any Stereo V2：より高速かつ強力な効率的ゼロショットステレオマッチング

要旨

近年のステレオマッチングの進展により顕著な精度が達成されているが、多くの場合、大規模モデルや重い計算、あるいは基盤モデルの事前知識に依存しており、リソース制約のあるプラットフォームへの展開は困難である。対照的に、効率的なステレオモデルはより高速な推論を実現するが、一般的にゼロショット汎化能力は低いと考えられている。本論文では、この前提に挑戦し、効率的なゼロショットステレオマッチングのために設計された超高速モデルシリーズであるLite Any Stereo V2（LAS2）を紹介する。LAS2は、アーキテクチャと訓練の両面から開発された。アーキテクチャ面では、実用的な導入設定下での効率的なステレオ設計を再検討し、理論的なMACsのみではなく実際の推論レイテンシに最適化された、2次元のみのコスト集約フレームワークを提案する。訓練面では、合成教師信号、自己蒸留、実世界知識蒸留を組み合わせた3段階戦略を開発する。実世界の擬似教師信号の信頼性を向上させるために、擬似ラベルフィルタリングとエラークランプ操作をさらに導入し、合成から実世界へのよりスムーズな転移を可能にする。LAS2はモデルファミリーとして具体化され、異なる効率予算向けのフィードフォワード変種と、より高い精度向けの反復変種を含む。広範な実験により、LAS2は効率的なステレオ手法の中で最先端の精度を達成しつつ、顕著に低いレイテンシを維持することが示された。具体的には、LAS2-Hは反復手法Fast-FoundationStereoよりも全体的なゼロショット性能が優れており、H200およびOrin上でそれぞれ1.8倍および2.7倍高速な推論を実現する。プロジェクトページ、デモ、コードは https://tomtomtommi.github.io/LiteAnyStereoV2/ で入手可能である。

English

Recent advances in stereo matching have achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast, efficient stereo models offer faster inference but are commonly considered less capable of strong zero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shot stereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-only cost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combines synthetic supervision, self-distillation, and real-world knowledge distillation. To improve the reliability of real-world pseudo supervision, we further introduce pseudo-label filtering and an error-clamping operation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, including feed-forward variants for different efficiency budgets and an iterative variant for higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at https://tomtomtommi.github.io/LiteAnyStereoV2/.