Styl3R: 임의의 장면과 스타일에 대한 즉각적인 3D 스타일화 재구성

초록

다중 뷰 일관성을 유지하면서 스타일 이미지를 충실히 닮은 3D 장면을 즉각적으로 스타일화하는 것은 여전히 중요한 과제로 남아 있습니다. 현재 최신 3D 스타일화 방법들은 일반적으로 사전 학습된 3D 표현에 예술적 특징을 전달하기 위해 계산 집약적인 테스트 시간 최적화를 포함하며, 종종 조밀한 포즈 입력 이미지가 필요합니다. 이와 대조적으로, 우리는 피드포워드 재구성 모델의 최근 발전을 활용하여, 포즈가 없는 희소 뷰 장면 이미지와 임의의 스타일 이미지를 사용하여 1초 미만으로 직접 3D 스타일화를 달성하는 새로운 접근 방식을 제시합니다. 재구성과 스타일화 사이의 본질적인 분리를 해결하기 위해, 우리는 구조 모델링과 외관 쉐이딩을 분리하는 분기된 아키텍처를 도입하여, 스타일 전달이 기본 3D 장면 구조를 왜곡하는 것을 효과적으로 방지합니다. 더 나아가, 우리는 새로운 뷰 합성 작업을 통해 스타일화 모델을 사전 학습하기 위해 아이덴티티 손실을 적용합니다. 이 전략은 또한 우리 모델이 스타일화를 위해 미세 조정되면서도 원래의 재구성 능력을 유지할 수 있게 합니다. 도메인 내 및 도메인 외 데이터셋을 사용한 포괄적인 평가는 우리의 접근 방식이 스타일과 장면 외관의 우수한 조합을 달성하는 고품질의 스타일화된 3D 콘텐츠를 생성하며, 다중 뷰 일관성과 효율성 측면에서 기존 방법들을 능가함을 보여줍니다.

English

Stylizing 3D scenes instantly while maintaining multi-view consistency and faithfully resembling a style image remains a significant challenge. Current state-of-the-art 3D stylization methods typically involve computationally intensive test-time optimization to transfer artistic features into a pretrained 3D representation, often requiring dense posed input images. In contrast, leveraging recent advances in feed-forward reconstruction models, we demonstrate a novel approach to achieve direct 3D stylization in less than a second using unposed sparse-view scene images and an arbitrary style image. To address the inherent decoupling between reconstruction and stylization, we introduce a branched architecture that separates structure modeling and appearance shading, effectively preventing stylistic transfer from distorting the underlying 3D scene structure. Furthermore, we adapt an identity loss to facilitate pre-training our stylization model through the novel view synthesis task. This strategy also allows our model to retain its original reconstruction capabilities while being fine-tuned for stylization. Comprehensive evaluations, using both in-domain and out-of-domain datasets, demonstrate that our approach produces high-quality stylized 3D content that achieve a superior blend of style and scene appearance, while also outperforming existing methods in terms of multi-view consistency and efficiency.