通过三维感知微调来改善二维特征表示

摘要

当前的视觉基础模型仅基于非结构化的2D数据进行训练，限制了它们对物体和场景的3D结构的理解。在这项研究中，我们展示了在3D感知数据上微调可以提高新兴语义特征的质量。我们设计了一种方法，将语义2D特征转换为高效的3D高斯表示，这使我们能够为任意视角重新渲染它们。利用渲染的3D感知特征，我们设计了一种微调策略，将这种3D感知转移到2D基础模型中。我们证明，通过这种方式微调的模型产生的特征能够通过简单的线性探测显著改善语义分割和深度估计等下游任务的性能。值得注意的是，尽管在单个室内数据集上进行微调，但改进是可以转移到各种室内数据集和域外数据集的。我们希望我们的研究能够鼓励社区在训练2D基础模型时考虑注入3D感知。项目页面：https://ywyue.github.io/FiT3D。

English

Current visual foundation models are trained purely on unstructured 2D data, limiting their understanding of 3D structure of objects and scenes. In this work, we show that fine-tuning on 3D-aware data improves the quality of emerging semantic features. We design a method to lift semantic 2D features into an efficient 3D Gaussian representation, which allows us to re-render them for arbitrary views. Using the rendered 3D-aware features, we design a fine-tuning strategy to transfer such 3D awareness into a 2D foundation model. We demonstrate that models fine-tuned in that way produce features that readily improve downstream task performance in semantic segmentation and depth estimation through simple linear probing. Notably, though fined-tuned on a single indoor dataset, the improvement is transferable to a variety of indoor datasets and out-of-domain datasets. We hope our study encourages the community to consider injecting 3D awareness when training 2D foundation models. Project page: https://ywyue.github.io/FiT3D.

通过三维感知微调来改善二维特征表示

Improving 2D Feature Representations by 3D-Aware Fine-Tuning

摘要

Support