자연 세계 이미지를 위한 전역 및 지역 함의 학습

초록

비전-언어 모델에서 데이터의 계층적 구조를 학습하는 것은 중요한 과제입니다. 기존 연구들은 이러한 과제를 해결하기 위해 함의 학습(entailment learning)을 적용하려 시도했습니다. 그러나 이러한 접근 방식들은 표현 공간 내에서 순서와 의미 간의 관계를 설정하는 함의의 전이적 특성을 명시적으로 모델링하지 못했습니다. 본 연구에서는 전이적 특성이 강제된 함의를 명시적으로 모델링할 수 있는 Radial Cross-Modal Embeddings (RCME) 프레임워크를 제안합니다. 우리가 제안한 프레임워크는 비전-언어 모델 내 개념들의 부분적 순서를 최적화합니다. 이 프레임워크를 활용하여, 우리는 생명의 나무(Tree of Life)의 계층 구조를 표현할 수 있는 계층적 비전-언어 기반 모델을 개발했습니다. 계층적 종 분류 및 계층적 검색 작업에 대한 실험을 통해, 우리의 모델이 기존 최첨단 모델 대비 향상된 성능을 보임을 입증했습니다. 우리의 코드와 모델은 https://vishu26.github.io/RCME/index.html에서 공개되어 있습니다.

English

Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

자연 세계 이미지를 위한 전역 및 지역 함의 학습

Global and Local Entailment Learning for Natural World Imagery

초록

Support