임베딩 기반 검색의 이론적 한계에 관하여

초록

벡터 임베딩은 지난 몇 년 동안 점점 더 많은 검색 작업을 수행해 왔으며, 최근에는 추론, 명령 수행, 코딩 등에 사용되기 시작했습니다. 이러한 새로운 벤치마크는 임베딩이 어떤 쿼리와 어떤 관련성 개념에 대해서도 작동하도록 요구합니다. 기존 연구들은 벡터 임베딩의 이론적 한계를 지적해 왔지만, 이러한 어려움은 비현실적인 쿼리 때문이며, 그렇지 않은 경우는 더 나은 훈련 데이터와 더 큰 모델로 극복할 수 있다는 일반적인 가정이 있습니다. 본 연구에서는 이러한 이론적 한계가 매우 간단한 쿼리를 사용한 현실적인 설정에서도 발생할 수 있음을 보여줍니다. 우리는 학습 이론의 알려진 결과를 연결하여, 특정 쿼리의 결과로 반환될 수 있는 상위 k개 문서 하위 집합의 수가 임베딩의 차원에 의해 제한된다는 것을 보여줍니다. 우리는 이를 k=2로 제한하고, 테스트 세트에서 자유 매개변수화된 임베딩을 직접 최적화해도 이 결과가 유지됨을 실증적으로 보여줍니다. 그런 다음, 이러한 이론적 결과를 기반으로 모델을 스트레스 테스트하는 LIMIT라는 현실적인 데이터셋을 생성하고, 작업이 단순함에도 불구하고 최첨단 모델들이 이 데이터셋에서 실패하는 것을 관찰합니다. 우리의 연구는 기존의 단일 벡터 패러다임 하에서 임베딩 모델의 한계를 보여주며, 이러한 근본적인 한계를 해결할 수 있는 방법을 개발하기 위한 향후 연구를 촉구합니다.

English

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

임베딩 기반 검색의 이론적 한계에 관하여

On the Theoretical Limitations of Embedding-Based Retrieval

초록

Support