双曲空間視覚言語モデルにおける部分対全体の意味的代表性に基づく不確実性誘導型合成的アライメント

要旨

ビジョン・ランゲージモデル（VLM）は優れた性能を達成しているが、そのユークリッド埋め込み表現は、部分-全体や親子構造といった階層的関係を捉えることに限界があり、複数オブジェクトの合成的シナリオでは課題に直面することが多い。双曲空間VLMは、包含関係を通じて階層構造と部分-全体関係（すなわち、全体シーンとその部分画像）をより良く保存・モデル化することでこの問題を緩和する。しかし、既存の手法では、各部分が全体に対して異なるレベルの意味的代表性を持つことをモデル化していない。本論文では、双曲空間VLMを強化するための不確実性誘導型合成的双曲空間調整（UNcertainty-guided Compositional Hyperbolic Alignment, UNCHA）を提案する。UNCHAは、部分-全体の意味的代表性を双曲空間の不確実性を用いてモデル化する。具体的には、全体シーンに対してより代表的な部分には低い不確実性を、より代表度の低い部分には高い不確実性を割り当てる。この代表性は、不確実性に基づく重みを用いて対照学習の目的関数に組み込まれる。最後に、エントロピーに基づく項で正則化された包含損失を用いて、不確実性を較正する。提案する損失関数により、UNCHAはより正確な部分-全体の順序関係を持つ双曲空間埋め込みを学習し、画像内の基盤的な合成的構造を捉え、複雑な複数オブジェクトシーンの理解を改善する。UNCHAは、ゼロショット分類、検索、マルチラベル分類のベンチマークにおいて、最先端の性能を達成する。コードおよびモデルは https://github.com/jeeit17/UNCHA.git で公開されている。

English

While Vision-Language Models (VLMs) have achieved remarkable performance, their Euclidean embeddings remain limited in capturing hierarchical relationships such as part-to-whole or parent-child structures, and often face challenges in multi-object compositional scenarios. Hyperbolic VLMs mitigate this issue by better preserving hierarchical structures and modeling part-whole relations (i.e., whole scene and its part images) through entailment. However, existing approaches do not model that each part has a different level of semantic representativeness to the whole. We propose UNcertainty-guided Compositional Hyperbolic Alignment (UNCHA) for enhancing hyperbolic VLMs. UNCHA models part-to-whole semantic representativeness with hyperbolic uncertainty, by assigning lower uncertainty to more representative parts and higher uncertainty to less representative ones for the whole scene. This representativeness is then incorporated into the contrastive objective with uncertainty-guided weights. Finally, the uncertainty is further calibrated with an entailment loss regularized by entropy-based term. With the proposed losses, UNCHA learns hyperbolic embeddings with more accurate part-whole ordering, capturing the underlying compositional structure in an image and improving its understanding of complex multi-object scenes. UNCHA achieves state-of-the-art performance on zero-shot classification, retrieval, and multi-label classification benchmarks. Our code and models are available at: https://github.com/jeeit17/UNCHA.git.

双曲空間視覚言語モデルにおける部分対全体の意味的代表性に基づく不確実性誘導型合成的アライメント

Uncertainty-guided Compositional Alignment with Part-to-Whole Semantic Representativeness in Hyperbolic Vision-Language Models

要旨

Support