埋め込みベース検索の理論的限界について

要旨

ベクトル埋め込みは、年々増え続ける検索タスクを担ってきましたが、最近では推論、指示追従、コーディングなどへの利用も萌芽的に見られます。これらの新しいベンチマークは、あらゆるクエリと関連性の概念に対して埋め込みが機能することを求めています。先行研究ではベクトル埋め込みの理論的限界が指摘されてきましたが、これらの困難は非現実的なクエリに起因するものであり、そうでないものはより良い訓練データと大規模モデルで克服できるという共通の前提がありました。本研究では、極めて単純なクエリを用いた現実的な設定においても、これらの理論的限界に直面する可能性があることを実証します。学習理論における既知の結果を結びつけ、クエリの結果として返される可能性のあるトップk個の文書サブセットの数が、埋め込みの次元によって制限されることを示します。k=2に限定し、テストセット上で自由にパラメータ化された埋め込みを最適化した場合でも、これが成り立つことを実験的に示します。次に、これらの理論的結果に基づいてモデルをストレステストする現実的なデータセットLIMITを作成し、タスクが単純であるにもかかわらず、最先端のモデルでさえこのデータセットで失敗することを観察します。本研究は、既存の単一ベクトルパラダイム下での埋め込みモデルの限界を示し、この根本的な制限を解決する手法の開発を求めるものです。

English

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

埋め込みベース検索の理論的限界について

On the Theoretical Limitations of Embedding-Based Retrieval

要旨

Support