Lite Any Stereo V2: 더 빠르고 강력한 효율적 제로샷 스테레오 매칭

초록

최근 스테레오 매칭 분야의 발전은 놀라운 정확도를 달성했지만, 대규모 모델, 높은 연산량 또는 추가적인 기반 모델 사전 지식에 의존하는 경우가 많아 자원이 제한된 플랫폼에 배포하기 어렵다. 반면, 효율적인 스테레오 모델은 더 빠른 추론을 제공하지만 일반적으로 강력한 제로샷 일반화 능력이 떨어지는 것으로 간주된다. 본 논문에서는 효율적인 제로샷 스테레오 매칭을 위해 설계된 초고속 모델 시리즈인 Lite Any Stereo V2 (LAS2)를 소개함으로써 이러한 가정에 도전한다. LAS2는 아키텍처와 훈련 관점 모두에서 개발되었다. 아키텍처 측면에서는 실제 배포 환경에서의 효율적인 스테레오 설계를 재검토하고, 이론적 MACs만이 아닌 실제 추론 지연 시간에 최적화된 2D 전용 비용 집계 프레임워크를 제안한다. 훈련을 위해 합성 데이터 지도 학습, 자가 증류, 실제 세계 지식 증류를 결합한 3단계 전략을 개발한다. 실제 세계 의사 지도 학습의 신뢰성을 높이기 위해 의사 레이블 필터링과 오차 클램핑 연산을 추가 도입하여 합성에서 실제로의 전이를 더욱 원활하게 만든다. LAS2는 다양한 효율성 예산에 맞는 피드포워드 변형과 더 높은 정확도를 위한 반복적 변형을 포함한 모델 패밀리로 구현된다. 광범위한 실험을 통해 LAS2가 효율적인 스테레오 방법 중 최첨단 정확도를 달성하면서도 현저히 낮은 지연 시간을 유지함을 보여준다. 구체적으로, LAS2-H는 반복적 방법인 Fast-FoundationStereo보다 전반적으로 더 강력한 제로샷 성능을 보이며, H200에서 1.8배, Orin에서 2.7배 더 빠른 추론 속도를 제공한다. 프로젝트 페이지, 데모 및 코드는 https://tomtomtommi.github.io/LiteAnyStereoV2/에서 확인할 수 있다.

English

Recent advances in stereo matching have achieved remarkable accuracy, but often rely on large models, heavy computation, or additional foundation-model priors, making them difficult to deploy on resource-constrained platforms. In contrast, efficient stereo models offer faster inference but are commonly considered less capable of strong zero-shot generalization. In this paper, we challenge this assumption by introducing Lite Any Stereo V2 (LAS2), an ultra-fast model series designed for efficient zero-shot stereo matching. LAS2 is developed from both architecture and training perspectives. Architecturally, we revisit efficient stereo design under practical deployment settings and propose a 2D-only cost aggregation framework, optimized for real inference latency rather than theoretical MACs alone. For training, we develop a three-stage strategy that combines synthetic supervision, self-distillation, and real-world knowledge distillation. To improve the reliability of real-world pseudo supervision, we further introduce pseudo-label filtering and an error-clamping operation, enabling smoother synthetic-to-real transfer. We instantiate LAS2 as a family of models, including feed-forward variants for different efficiency budgets and an iterative variant for higher accuracy. Extensive experiments show that LAS2 achieves state-of-the-art accuracy among efficient stereo methods while maintaining significantly lower latency. Specifically, LAS2-H achieves stronger overall zero-shot performance than the iterative method Fast-FoundationStereo, with 1.8x and 2.7x faster inference on H200 and Orin, respectively. The project page, demos, and code are available at https://tomtomtommi.github.io/LiteAnyStereoV2/.