엔터프라이즈 시스템에서 도메인 특화 검색을 위한 하드 네거티브 마이닝

초록

엔터프라이즈 검색 시스템은 의미론적 불일치와 중복된 용어로 인해 정확한 도메인 특화 정보를 검색하는 데 종종 어려움을 겪습니다. 이러한 문제는 지식 관리, 고객 지원, 검색 강화 생성 에이전트와 같은 다운스트림 애플리케이션의 성능을 저하시킬 수 있습니다. 이러한 문제를 해결하기 위해, 우리는 도메인 특화 엔터프라이즈 데이터에 맞춤화된 확장 가능한 하드 네거티브 마이닝 프레임워크를 제안합니다. 우리의 접근 방식은 배포된 재순위 모델을 향상시키기 위해 의미적으로 도전적이지만 문맥상 관련 없는 문서를 동적으로 선택합니다. 우리의 방법은 다양한 임베딩 모델을 통합하고, 차원 축소를 수행하며, 고유한 하드 네거티브를 선택하여 계산 효율성과 의미론적 정밀도를 보장합니다. 클라우드 서비스 도메인의 독점 엔터프라이즈 코퍼스에 대한 평가에서, 최신 베이스라인 및 기타 네거티브 샘플링 기법과 비교하여 MRR@3에서 15%, MRR@10에서 19%의 상당한 개선을 보여줍니다. 또한, 공개된 도메인 특화 데이터셋(FiQA, Climate Fever, TechQA)에 대한 추가 검증을 통해 우리의 방법의 일반화 가능성과 실제 애플리케이션 준비 상태를 확인했습니다.

English

Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.