企业系统中领域特定检索的困难负样本挖掘
Hard Negative Mining for Domain-Specific Retrieval in Enterprise Systems
May 23, 2025
作者: Hansa Meghwani, Amit Agarwal, Priyaranjan Pattnayak, Hitesh Laxmichand Patel, Srikant Panda
cs.AI
摘要
企业搜索系统常因语义不匹配和术语重叠而难以检索到准确的领域特定信息,这些问题会降低下游应用(如知识管理、客户支持和检索增强生成代理)的性能。为应对这一挑战,我们提出了一种专为领域特定企业数据设计的可扩展硬负样本挖掘框架。该方法动态选择语义上具有挑战性但上下文无关的文档,以增强已部署的重排序模型。
我们的方法整合了多种嵌入模型,执行降维,并独特地选择硬负样本,确保计算效率和语义精确性。在我们专有的企业语料库(云服务领域)上的评估显示,相较于最先进的基线和其他负采样技术,MRR@3提升了15%,MRR@10提升了19%。在公开的领域特定数据集(FiQA、Climate Fever、TechQA)上的进一步验证证实了该方法的通用性和实际应用准备度。
English
Enterprise search systems often struggle to retrieve accurate,
domain-specific information due to semantic mismatches and overlapping
terminologies. These issues can degrade the performance of downstream
applications such as knowledge management, customer support, and
retrieval-augmented generation agents. To address this challenge, we propose a
scalable hard-negative mining framework tailored specifically for
domain-specific enterprise data. Our approach dynamically selects semantically
challenging but contextually irrelevant documents to enhance deployed
re-ranking models.
Our method integrates diverse embedding models, performs dimensionality
reduction, and uniquely selects hard negatives, ensuring computational
efficiency and semantic precision. Evaluation on our proprietary enterprise
corpus (cloud services domain) demonstrates substantial improvements of 15\% in
MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other
negative sampling techniques. Further validation on public domain-specific
datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability
and readiness for real-world applications.Summary
AI-Generated Summary