自然世界画像のためのグローバルおよびローカルな含意関係学習

要旨

視覚言語モデルにおけるデータの階層構造を学習することは重要な課題である。これまでの研究では、含意関係学習を活用することでこの課題に取り組んできた。しかし、これらのアプローチは、表現空間内での順序と意味の関係を確立する含意関係の推移性を明示的にモデル化することに失敗している。本研究では、推移性を強制する含意関係を明示的にモデル化するためのフレームワークであるRadial Cross-Modal Embeddings（RCME）を提案する。提案フレームワークは、視覚言語モデル内の概念の半順序を最適化する。本フレームワークを活用することで、生命の木における階層を表現可能な階層型視覚言語基盤モデルを開発した。階層的な種分類および階層的検索タスクにおける実験により、提案モデルが既存の最先端モデルと比較して性能が向上していることを示す。コードおよびモデルはhttps://vishu26.github.io/RCME/index.htmlで公開されている。

English

Learning the hierarchical structure of data in vision-language models is a significant challenge. Previous works have attempted to address this challenge by employing entailment learning. However, these approaches fail to model the transitive nature of entailment explicitly, which establishes the relationship between order and semantics within a representation space. In this work, we introduce Radial Cross-Modal Embeddings (RCME), a framework that enables the explicit modeling of transitivity-enforced entailment. Our proposed framework optimizes for the partial order of concepts within vision-language models. By leveraging our framework, we develop a hierarchical vision-language foundation model capable of representing the hierarchy in the Tree of Life. Our experiments on hierarchical species classification and hierarchical retrieval tasks demonstrate the enhanced performance of our models compared to the existing state-of-the-art models. Our code and models are open-sourced at https://vishu26.github.io/RCME/index.html.

自然世界画像のためのグローバルおよびローカルな含意関係学習

Global and Local Entailment Learning for Natural World Imagery

要旨

Support