エンタープライズシステムにおけるドメイン固有検索のためのハードネガティブマイニング

要旨

エンタープライズ検索システムは、セマンティックなミスマッチや重複する用語体系のため、正確なドメイン固有情報の取得に苦戦することが多い。これらの問題は、ナレッジマネジメント、カスタマーサポート、検索拡張生成エージェントなどの下流アプリケーションの性能を低下させる可能性がある。この課題に対処するため、我々はドメイン固有のエンタープライズデータに特化したスケーラブルなハードネガティブマイニングフレームワークを提案する。本アプローチでは、デプロイされた再ランキングモデルを強化するために、セマンティックに挑戦的だが文脈的には無関係な文書を動的に選択する。我々の手法は、多様な埋め込みモデルを統合し、次元削減を実行し、計算効率とセマンティック精度を確保しながら独自にハードネガティブを選択する。独自のエンタープライズコーパス（クラウドサービスドメイン）での評価では、最先端のベースラインや他のネガティブサンプリング手法と比較して、MRR@3で15%、MRR@10で19%の大幅な改善を示した。さらに、公開されているドメイン固有データセット（FiQA、Climate Fever、TechQA）での検証により、本手法の汎用性と実世界アプリケーションへの適用性が確認された。

English

Enterprise search systems often struggle to retrieve accurate, domain-specific information due to semantic mismatches and overlapping terminologies. These issues can degrade the performance of downstream applications such as knowledge management, customer support, and retrieval-augmented generation agents. To address this challenge, we propose a scalable hard-negative mining framework tailored specifically for domain-specific enterprise data. Our approach dynamically selects semantically challenging but contextually irrelevant documents to enhance deployed re-ranking models. Our method integrates diverse embedding models, performs dimensionality reduction, and uniquely selects hard negatives, ensuring computational efficiency and semantic precision. Evaluation on our proprietary enterprise corpus (cloud services domain) demonstrates substantial improvements of 15\% in MRR@3 and 19\% in MRR@10 compared to state-of-the-art baselines and other negative sampling techniques. Further validation on public domain-specific datasets (FiQA, Climate Fever, TechQA) confirms our method's generalizability and readiness for real-world applications.