通過3D感知微調來改善2D特徵表示
Improving 2D Feature Representations by 3D-Aware Fine-Tuning
July 29, 2024
作者: Yuanwen Yue, Anurag Das, Francis Engelmann, Siyu Tang, Jan Eric Lenssen
cs.AI
摘要
目前的視覺基礎模型僅訓練於無結構的2D數據,限制了對物體和場景的3D結構的理解。在這項工作中,我們展示了在3D感知數據上進行微調可以提高新興語義特徵的質量。我們設計了一種方法將語義2D特徵提升到高效的3D高斯表示,這使我們能夠為任意視角重新渲染它們。利用渲染的3D感知特徵,我們設計了一種微調策略,將這種3D感知轉移到2D基礎模型中。我們展示了通過簡單的線性探測,通過微調的模型產生的特徵可以立即改善語義分割和深度估計等下游任務的性能。值得注意的是,儘管在單個室內數據集上進行微調,但這種改進是可轉移的,適用於各種室內數據集和跨領域數據集。我們希望我們的研究能鼓勵社區在訓練2D基礎模型時考慮注入3D感知。項目頁面: https://ywyue.github.io/FiT3D。
English
Current visual foundation models are trained purely on unstructured 2D data,
limiting their understanding of 3D structure of objects and scenes. In this
work, we show that fine-tuning on 3D-aware data improves the quality of
emerging semantic features. We design a method to lift semantic 2D features
into an efficient 3D Gaussian representation, which allows us to re-render them
for arbitrary views. Using the rendered 3D-aware features, we design a
fine-tuning strategy to transfer such 3D awareness into a 2D foundation model.
We demonstrate that models fine-tuned in that way produce features that readily
improve downstream task performance in semantic segmentation and depth
estimation through simple linear probing. Notably, though fined-tuned on a
single indoor dataset, the improvement is transferable to a variety of indoor
datasets and out-of-domain datasets. We hope our study encourages the community
to consider injecting 3D awareness when training 2D foundation models. Project
page: https://ywyue.github.io/FiT3D.Summary
AI-Generated Summary