高密度検索器は単純なクエリに失敗する可能性がある：埋め込みの粒度ジレンマの解明

要旨

本研究は、テキストエンコーダが持つ観察された限界に焦点を当てている：埋め込み表現が意味論内の細かなエンティティやイベントを認識できない可能性があり、その結果、単純なケースにおいても密な検索が失敗することがある。このような挙動を検証するため、まず中国語の新しい評価データセット「CapRetrieval」を導入する。このデータセットのパッセージは画像キャプションであり、クエリは様々な形式でエンティティやイベントを問いかけるフレーズである。ゼロショット評価の結果、エンコーダはこれらの細かなマッチングに失敗する可能性があり、その傾向はトレーニングソースやモデルサイズに関わらず見られることが示唆された。改善を目指して、我々は提案するデータ生成戦略を用いてエンコーダをファインチューニングし、CapRetrievalにおいて最高の性能を達成した。このプロセスの中で、さらに「粒度のジレンマ」という問題を特定した。これは、埋め込み表現が全体的な意味論と整合しつつ、細かな重要性を表現する際に直面する課題である。本研究のデータセット、コード、モデルはhttps://github.com/lxucs/CapRetrievalで公開されている。

English

This work focuses on an observed limitation of text encoders: embeddings may not be able to recognize fine-grained entities or events within the semantics, resulting in failed dense retrieval on even simple cases. To examine such behaviors, we first introduce a new evaluation dataset in Chinese, named CapRetrieval, whose passages are image captions, and queries are phrases inquiring entities or events in various forms. Zero-shot evaluation suggests that encoders may fail on these fine-grained matching, regardless of training sources or model sizes. Aiming for enhancement, we proceed to finetune encoders with our proposed data generation strategies, which obtains the best performance on CapRetrieval. Within this process, we further identify an issue of granularity dilemma, a challenge for embeddings to express fine-grained salience while aligning with overall semantics. Our dataset, code and models in this work are publicly released at https://github.com/lxucs/CapRetrieval.