從任務中心視角揭示隱性陷阱並引領下一代向量相似性搜索發展
Reveal Hidden Pitfalls and Navigate Next Generation of Vector Similarity Search from Task-Centric Views
December 15, 2025
作者: Tingyang Chen, Cong Fu, Jiahua Wu, Haotian Wu, Hua Fan, Xiangyu Ke, Yunjun Gao, Yabo Ni, Anxiang Zeng
cs.AI
摘要
高維空間中的向量相似性搜尋(VSS)正迅速成為新一代資料庫系統的核心功能,廣泛應用於各類數據密集型服務——從大型語言模型(LLM)的嵌入檢索,到語意資訊檢索與推薦系統。然而,現有基準測試主要基於召回率與延遲的權衡來評估VSS,其真實值僅由距離指標定義,忽略了檢索品質對下游任務的最終影響。這種脫節可能誤導學術研究與產業實踐。
我們提出Iceberg——一個在真實應用場景中對VSS方法進行端到端評估的綜合基準測試套件。從任務中心視角出發,Iceberg揭示了「資訊流失漏斗」,指出端到端效能衰減的三個主要來源:(1)特徵提取過程中的嵌入損失;(2)距離指標無法有效反映任務相關性的度量誤用;(3)對數據分佈敏感度,凸顯索引在不同偏差與模態下的穩健性。為實現更全面評估,Iceberg涵蓋圖像分類、人臉識別、文本檢索和推薦系統等關鍵領域的八個多樣化數據集。每個數據集包含100萬至1億條向量,並提供豐富的任務專用標籤與評估指標,使檢索演算法能在完整應用流程中(而非孤立環境)被評估。
Iceberg對13種前沿VSS方法進行基準測試,並根據應用層級指標重新排序,結果顯示其與傳統純基於召回率-延遲評估的排名存在顯著差異。基於這些發現,我們定義了一組任務中心的元特徵,並推導出可解釋的決策樹,以協助從業者根據具體工作負載選擇和調優VSS方法。
English
Vector Similarity Search (VSS) in high-dimensional spaces is rapidly emerging as core functionality in next-generation database systems for numerous data-intensive services -- from embedding lookups in large language models (LLMs), to semantic information retrieval and recommendation engines. Current benchmarks, however, evaluate VSS primarily on the recall-latency trade-off against a ground truth defined solely by distance metrics, neglecting how retrieval quality ultimately impacts downstream tasks. This disconnect can mislead both academic research and industrial practice.
We present Iceberg, a holistic benchmark suite for end-to-end evaluation of VSS methods in realistic application contexts. From a task-centric view, Iceberg uncovers the Information Loss Funnel, which identifies three principal sources of end-to-end performance degradation: (1) Embedding Loss during feature extraction; (2) Metric Misuse, where distances poorly reflect task relevance; (3) Data Distribution Sensitivity, highlighting index robustness across skews and modalities. For a more comprehensive assessment, Iceberg spans eight diverse datasets across key domains such as image classification, face recognition, text retrieval, and recommendation systems. Each dataset, ranging from 1M to 100M vectors, includes rich, task-specific labels and evaluation metrics, enabling assessment of retrieval algorithms within the full application pipeline rather than in isolation. Iceberg benchmarks 13 state-of-the-art VSS methods and re-ranks them based on application-level metrics, revealing substantial deviations from traditional rankings derived purely from recall-latency evaluations. Building on these insights, we define a set of task-centric meta-features and derive an interpretable decision tree to guide practitioners in selecting and tuning VSS methods for their specific workloads.