全球與局部蘊含學習於自然世界影像之應用

摘要

在视觉-语言模型中学习数据的层次结构是一项重大挑战。先前的研究尝试通过蕴含学习来解决这一难题。然而，这些方法未能明确地建模蕴含的传递性，而传递性在表示空间内确立了顺序与语义之间的关系。在本研究中，我们提出了径向跨模态嵌入（Radial Cross-Modal Embeddings, RCME）框架，该框架能够显式地建模强制传递性的蕴含关系。我们提出的框架优化了视觉-语言模型内概念的偏序关系。通过利用这一框架，我们开发了一个能够表示生命之树层次结构的层次化视觉-语言基础模型。我们在层次化物种分类和层次化检索任务上的实验表明，相较于现有的最先进模型，我们的模型性能得到了显著提升。我们的代码和模型已在https://vishu26.github.io/RCME/index.html开源。

English

Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.