论基于嵌入的检索方法的理论局限

摘要

多年来，向量嵌入技术被赋予了越来越多的检索任务，并逐渐应用于推理、指令遵循、编码等领域。这些新的基准测试要求嵌入模型能够处理任何查询及任何可能的相关性概念。尽管先前的研究指出了向量嵌入在理论上的局限性，但普遍认为这些困难仅源于不切实际的查询，而通过更好的训练数据和更大的模型可以克服那些非不切实际的查询带来的挑战。在本研究中，我们证明，在现实场景中，即便是极其简单的查询，也可能触及这些理论限制。我们结合学习理论中的已知结果，揭示了能够作为某些查询结果返回的文档前k个子集的数量受限于嵌入的维度。我们通过实验证实，即使将k限制为2，并在测试集上自由参数化嵌入进行直接优化，这一结论依然成立。随后，我们基于这些理论成果创建了一个名为LIMIT的现实数据集，用于对模型进行压力测试，并观察到即使是最先进的模型，在面对这一简单任务时也表现不佳。我们的研究揭示了现有单一向量范式下嵌入模型的局限，呼吁未来研究开发能够解决这一根本限制的方法。

English

Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.

论基于嵌入的检索方法的理论局限

On the Theoretical Limitations of Embedding-Based Retrieval

摘要

Support