ChatPaper.aiChatPaper

論基於嵌入之檢索的理論限制

On the Theoretical Limitations of Embedding-Based Retrieval

August 28, 2025
作者: Orion Weller, Michael Boratko, Iftekhar Naim, Jinhyuk Lee
cs.AI

摘要

向量嵌入技術近年來承擔了日益增多的檢索任務,並在推理、指令遵循、編碼等領域展現出初步的應用潛力。這些新興的基準測試要求嵌入模型能夠處理任何查詢及任何可能給出的相關性概念。儘管先前的研究已指出向量嵌入在理論上的局限性,但普遍假設這些困難僅源於不切實際的查詢,而對於那些合理的查詢,則可通過更好的訓練數據和更大的模型來克服。在本研究中,我們證實了即便在極其簡單的查詢下,於現實場景中也可能遭遇這些理論限制。我們結合學習理論中的已知結果,表明能夠作為某些查詢結果返回的文檔top-k子集的數量,受制於嵌入的維度。我們通過實證研究證明,即使將k限制為2,並在測試集上自由參數化嵌入進行直接優化,這一結論依然成立。隨後,我們基於這些理論結果創建了一個名為LIMIT的現實數據集,對模型進行壓力測試,並觀察到即便在任務極為簡單的情況下,最先進的模型在該數據集上仍表現不佳。我們的工作揭示了現有單一向量範式下嵌入模型的局限性,並呼籲未來研究開發能夠解決這一根本限制的方法。
English
Vector embeddings have been tasked with an ever-increasing set of retrieval tasks over the years, with a nascent rise in using them for reasoning, instruction-following, coding, and more. These new benchmarks push embeddings to work for any query and any notion of relevance that could be given. While prior works have pointed out theoretical limitations of vector embeddings, there is a common assumption that these difficulties are exclusively due to unrealistic queries, and those that are not can be overcome with better training data and larger models. In this work, we demonstrate that we may encounter these theoretical limitations in realistic settings with extremely simple queries. We connect known results in learning theory, showing that the number of top-k subsets of documents capable of being returned as the result of some query is limited by the dimension of the embedding. We empirically show that this holds true even if we restrict to k=2, and directly optimize on the test set with free parameterized embeddings. We then create a realistic dataset called LIMIT that stress tests models based on these theoretical results, and observe that even state-of-the-art models fail on this dataset despite the simple nature of the task. Our work shows the limits of embedding models under the existing single vector paradigm and calls for future research to develop methods that can resolve this fundamental limitation.
PDF141September 3, 2025