AutoMIR：関連性ラベルなしで効果的なゼロショット医療情報検索

要旨

医療情報検索（MIR）は、電子健康記録、学術文献、医学データベースなど多様な情報源から関連する医学知識を取得するために不可欠です。しかし、医療分野におけるゼロショット密な検索の実現は、関連付けられたデータが不足していることから、著しい課題を抱えています。本論文では、この問題に取り組むために、セルフラーニング仮説文書埋め込み（SL-HyDE）と呼ばれる新しいアプローチを紹介します。SL-HyDEは、大規模言語モデル（LLM）を生成器として活用し、与えられたクエリに基づいて仮説文書を生成します。これらの生成された文書は、主要な医学的文脈を包括し、密なリトリーバーが最も関連性の高い文書を特定するのに役立ちます。セルフラーニングフレームワークは、関連付けられたデータが不要である医学コーパスを活用し、疑似文書生成と検索の両方を段階的に洗練させます。さらに、実世界の医学シナリオに基づいた包括的な評価フレームワークである中国医学情報検索ベンチマーク（CMIRB）を紹介します。このフレームワークには、5つのタスクと10のデータセットが含まれています。CMIRBで10のモデルをベンチマークすることで、医療情報検索システムの評価のための厳格な基準を確立します。実験結果は、SL-HyDEが既存の手法を大幅に上回り、LLMとリトリーバーの構成において強力な汎化性と拡張性を示していることを示しています。CMIRBのデータと評価コードは、次のURLから公開されています：https://github.com/CMIRB-benchmark/CMIRB。

English

Medical information retrieval (MIR) is essential for retrieving relevant medical knowledge from diverse sources, including electronic health records, scientific literature, and medical databases. However, achieving effective zero-shot dense retrieval in the medical domain poses substantial challenges due to the lack of relevance-labeled data. In this paper, we introduce a novel approach called Self-Learning Hypothetical Document Embeddings (SL-HyDE) to tackle this issue. SL-HyDE leverages large language models (LLMs) as generators to generate hypothetical documents based on a given query. These generated documents encapsulate key medical context, guiding a dense retriever in identifying the most relevant documents. The self-learning framework progressively refines both pseudo-document generation and retrieval, utilizing unlabeled medical corpora without requiring any relevance-labeled data. Additionally, we present the Chinese Medical Information Retrieval Benchmark (CMIRB), a comprehensive evaluation framework grounded in real-world medical scenarios, encompassing five tasks and ten datasets. By benchmarking ten models on CMIRB, we establish a rigorous standard for evaluating medical information retrieval systems. Experimental results demonstrate that SL-HyDE significantly surpasses existing methods in retrieval accuracy while showcasing strong generalization and scalability across various LLM and retriever configurations. CMIRB data and evaluation code are publicly available at: https://github.com/CMIRB-benchmark/CMIRB.

AutoMIR：関連性ラベルなしで効果的なゼロショット医療情報検索

AutoMIR: Effective Zero-Shot Medical Information Retrieval without Relevance Labels

要旨

Support