企業系統中特定領域檢索的困難負樣本挖掘
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
May 23, 2025
作者: Hansa Meghwani, Amit Agarwal, Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Srikant Panda
cs.AI
摘要
企業搜尋系統常因語意不匹配和術語重疊而難以檢索準確的領域特定資訊。這些問題可能降低下游應用程式的效能,如知識管理、客戶支援和檢索增強生成代理。為應對這一挑戰,我們提出了一個可擴展的困難負樣本挖掘框架,專門針對領域特定的企業數據進行優化。我們的方法動態選擇語意上具有挑戰性但上下文無關的文件,以增強已部署的重新排序模型。
我們的方法整合了多種嵌入模型,執行維度降低,並獨特地選擇困難負樣本,確保計算效率和語意精確性。在我們專有的企業語料庫(雲端服務領域)上的評估顯示,與最先進的基線和其他負樣本採樣技術相比,MRR@3 提升了 15%,MRR@10 提升了 19%。在公共領域特定數據集(FiQA、Climate Fever、TechQA)上的進一步驗證證實了我們方法的通用性和實際應用的準備度。
English
Enterprise search systems often struggle to retrieve accurate,
domain-specific information due to semantic mismatches and overlapping
terminologies. These issues can degrade the performance of downstream
applications such as knowledge management, customer support, and
retrieval-augmented generation agents. To address this challenge, we propose a
scalable hard-negative mining framework tailored specifically for
domain-specific enterprise data. Our approach dynamically selects semantically
challenging but contextually irrelevant documents to enhance deployed
re-ranking models.
Our method integrates diverse embedding models, performs dimensionality
reduction, and uniquely selects hard negatives, ensuring computational
efficiency and semantic precision. Evaluation on our proprietary enterprise
corpus (cloud services domain) demonstrates substantial improvements of 15\% in
MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other
negative sampling techniques. Further validation on public domain-specific
datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability
and readiness for real-world applications.Summary
AI-Generated Summary