Lite3R: 효율적인 피드포워드 3D 재구성을 위한 모델 무관 프레임워크

초록

트랜스포머 기반 3D 재구성은 다중 뷰 관측에서 기하학과 외관을 복원하는 강력한 패러다임으로 부상했으며, 까다로운 시각적 조건에서도 뛰어난 성능을 제공한다. 이러한 모델이 더 큰 백본과 고해상도 입력으로 확장됨에 따라 효율성을 개선하는 것이 실제 배포에서 점점 더 중요해지고 있다. 그러나 최신 3D 트랜스포머 파이프라인은 두 가지 상호 연결된 과제에 직면한다. 첫째, 밀집된 다중 뷰 어텐션은 상당한 토큰 혼합 오버헤드를 발생시키고, 둘째, 저정밀도 실행은 기하학에 민감한 표현을 불안정하게 만들어 깊이, 포즈 및 3D 일관성을 저하시킬 수 있다. 첫 번째 과제를 해결하기 위해 우리는 Lite3R을 제안한다. 이는 모델에 구애받지 않는 교사-학생 프레임워크로, 밀집 어텐션을 희소 선형 어텐션으로 대체하여 중요한 기하학적 상호작용을 유지하면서 어텐션 비용을 줄인다. 두 번째 과제를 해결하기 위해 우리는 부분 어텐션 증류를 활용한 파라미터 효율적인 FP8 인식 양자화 인식 훈련(FP8-aware QAT) 전략을 도입한다. 이는 사전 훈련된 백본 파라미터의 대부분을 고정하고 가벼운 선형 분기 투영 레이어만 학습시켜, 사전 훈련된 기하학적 사전 정보를 유지하면서 안정적인 저정밀도 배포를 가능하게 한다. 나아가 Lite3R을 두 개의 대표적인 백본(VGGT 및 DA3-Large)에 대해 BlendedMVS와 DTU64 데이터셋에서 평가한 결과, 전체적인 재구성 품질을 유지하면서 지연 시간(1.7~2.0배)과 메모리 사용량(1.9~2.4배)을 크게 줄이는 것을 확인했다. 이러한 결과는 Lite3R이 실용적인 트랜스포머 기반 3D 재구성을 위한 효과적인 알고리즘-시스템 공동 설계 접근법을 제공함을 보여준다. 코드: https://github.com/AIGeeksGroup/Lite3R. 웹사이트: https://aigeeksgroup.github.io/Lite3R.

English

Transformer-based 3D reconstruction has emerged as a powerful paradigm for recovering geometry and appearance from multi-view observations, offering strong performance across challenging visual conditions. As these models scale to larger backbones and higher-resolution inputs, improving their efficiency becomes increasingly important for practical deployment. However, modern 3D transformer pipelines face two coupled challenges: dense multi-view attention creates substantial token-mixing overhead, and low-precision execution can destabilize geometry-sensitive representations and degrade depth, pose, and 3D consistency. To address the first challenge, we propose Lite3R, a model-agnostic teacher-student framework that replaces dense attention with Sparse Linear Attention to preserve important geometric interactions while reducing attention cost. To address the second challenge, we introduce a parameter-efficient FP8-aware quantization-aware training (FP8-aware QAT) strategy with partial attention distillation, which freezes the vast majority of pretrained backbone parameters and trains only lightweight linear-branch projection layers, enabling stable low-precision deployment while retaining pretrained geometric priors. We further evaluate Lite3R on two representative backbones, VGGT and DA3-Large, over BlendedMVS and DTU64, showing that it substantially reduces latency (1.7-2.0x) and memory usage (1.9-2.4x) while preserving competitive reconstruction quality overall. These results demonstrate that Lite3R provides an effective algorithm-system co-design approach for practical transformer-based 3D reconstruction. Code: https://github.com/AIGeeksGroup/Lite3R. Website: https://aigeeksgroup.github.io/Lite3R.

Lite3R: 효율적인 피드포워드 3D 재구성을 위한 모델 무관 프레임워크

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

초록

Support