探究視覺基礎模型對三維感知的能力

摘要

最近在大規模預訓練方面取得的進展已經產生了具有強大能力的視覺基礎模型。這些最新模型不僅能夠對其訓練任務的任意圖像進行泛化，而且它們的中間表示對於其他視覺任務，如檢測和分割，也是有用的。考慮到這些模型可以對2D中的物體進行分類、描繪和定位，我們問：它們是否也代表了物體的3D結構？在這項工作中，我們分析了視覺基礎模型的3D意識。我們假設3D意識意味著表示（1）編碼了場景的3D結構，並且（2）在不同視角下一致地表示表面。我們通過使用特定任務的探針和凍結特徵上的零-shot推斷程序進行了一系列實驗。我們的實驗揭示了目前模型的幾個局限性。我們的代碼和分析可在https://github.com/mbanani/probe3d 找到。

English

Recent advances in large-scale pretraining have yielded visual foundation models with strong capabilities. Not only can recent models generalize to arbitrary images for their training task, their intermediate representations are useful for other visual tasks such as detection and segmentation. Given that such models can classify, delineate, and localize objects in 2D, we ask whether they also represent their 3D structure? In this work, we analyze the 3D awareness of visual foundation models. We posit that 3D awareness implies that representations (1) encode the 3D structure of the scene and (2) consistently represent the surface across views. We conduct a series of experiments using task-specific probes and zero-shot inference procedures on frozen features. Our experiments reveal several limitations of the current models. Our code and analysis can be found at https://github.com/mbanani/probe3d.

探究視覺基礎模型對三維感知的能力

Probing the 3D Awareness of Visual Foundation Models

摘要

Support