企业系统中领域特定检索的困难负样本挖掘

摘要

企业搜索系统常因语义不匹配和术语重叠而难以检索到准确的领域特定信息，这些问题会降低下游应用（如知识管理、客户支持和检索增强生成代理）的性能。为应对这一挑战，我们提出了一种专为领域特定企业数据设计的可扩展硬负样本挖掘框架。该方法动态选择语义上具有挑战性但上下文无关的文档，以增强已部署的重排序模型。我们的方法整合了多种嵌入模型，执行降维，并独特地选择硬负样本，确保计算效率和语义精确性。在我们专有的企业语料库（云服务领域）上的评估显示，相较于最先进的基线和其他负采样技术，MRR@3提升了15%，MRR@10提升了19%。在公开的领域特定数据集（FiQA、Climate Fever、TechQA）上的进一步验证证实了该方法的通用性和实际应用准备度。

English

Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.