3D 인식 미세 조정을 통한 2D 특징 표현 개선

초록

현재의 시각적 기반 모델들은 순수하게 비정형 2D 데이터로만 학습되어, 객체와 장면의 3D 구조에 대한 이해가 제한적입니다. 본 연구에서는 3D 인식 데이터를 활용한 미세 조정(fine-tuning)이 새롭게 등장하는 의미론적 특징(semantic features)의 품질을 향상시킨다는 것을 보여줍니다. 우리는 의미론적 2D 특징을 효율적인 3D 가우시안 표현(Gaussian representation)으로 변환하는 방법을 설계하여, 이를 임의의 시점에서 재렌더링할 수 있도록 했습니다. 렌더링된 3D 인식 특징을 사용하여, 이러한 3D 인식을 2D 기반 모델로 전달하는 미세 조정 전략을 설계했습니다. 이러한 방식으로 미세 조정된 모델은 단순한 선형 탐색(linear probing)을 통해 의미론적 분할(semantic segmentation) 및 깊이 추정(depth estimation)과 같은 하위 작업의 성능을 즉각적으로 개선하는 특징을 생성함을 입증했습니다. 특히, 단일 실내 데이터셋에서 미세 조정되었음에도 불구하고, 이러한 개선은 다양한 실내 데이터셋과 도메인 외(out-of-domain) 데이터셋으로도 전이 가능했습니다. 우리는 이 연구가 2D 기반 모델을 학습할 때 3D 인식을 주입하는 것을 고려하도록 커뮤니티를 독려하기를 바랍니다. 프로젝트 페이지: https://ywyue.github.io/FiT3D.

English

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.

3D 인식 미세 조정을 통한 2D 특징 표현 개선

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

초록

Support