企業系統中特定領域檢索的困難負樣本挖掘

摘要

企業搜尋系統常因語意不匹配和術語重疊而難以檢索準確的領域特定資訊。這些問題可能降低下游應用程式的效能，如知識管理、客戶支援和檢索增強生成代理。為應對這一挑戰，我們提出了一個可擴展的困難負樣本挖掘框架，專門針對領域特定的企業數據進行優化。我們的方法動態選擇語意上具有挑戰性但上下文無關的文件，以增強已部署的重新排序模型。我們的方法整合了多種嵌入模型，執行維度降低，並獨特地選擇困難負樣本，確保計算效率和語意精確性。在我們專有的企業語料庫（雲端服務領域）上的評估顯示，與最先進的基線和其他負樣本採樣技術相比，MRR@3 提升了 15%，MRR@10 提升了 19%。在公共領域特定數據集（FiQA、Climate Fever、TechQA）上的進一步驗證證實了我們方法的通用性和實際應用的準備度。

English

Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.