Apprentissage global et local de l'implication pour l'imagerie du monde naturel

papers.abstract

L'apprentissage de la structure hiérarchique des données dans les modèles vision-langage représente un défi majeur. Les travaux précédents ont tenté de relever ce défi en utilisant l'apprentissage par implication. Cependant, ces approches ne parviennent pas à modéliser explicitement la nature transitive de l'implication, qui établit la relation entre l'ordre et la sémantique dans un espace de représentation. Dans ce travail, nous introduisons Radial Cross-Modal Embeddings (RCME), un cadre permettant la modélisation explicite de l'implication renforcée par transitivité. Notre cadre proposé optimise l'ordre partiel des concepts au sein des modèles vision-langage. En exploitant notre cadre, nous développons un modèle de base vision-langage hiérarchique capable de représenter la hiérarchie dans l'Arbre de la Vie. Nos expériences sur des tâches de classification hiérarchique des espèces et de récupération hiérarchique démontrent la performance accrue de nos modèles par rapport aux modèles de pointe existants. Notre code et nos modèles sont open-source à l'adresse https://vishu26.github.io/RCME/index.html.

English

Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

Apprentissage global et local de l'implication pour l'imagerie du monde naturel

Global and Local Entailment Learning for Natural World Imagery

papers.abstract

Support