大規模言語モデルを活用した科学的な新規性検出

要旨

科学が指数関数的に成長する時代において、学術界では新規性のある研究アイデアを特定することが重要でありながらも困難な課題となっています。潜在的な可能性があるにもかかわらず、適切なベンチマークデータセットの欠如が、新規性検出の研究を妨げています。さらに重要なことに、既存の自然言語処理（NLP）技術、例えば検索してクロスチェックするといった手法を単純に採用することは、テキストの類似性とアイデアの概念化の間にあるギャップのために万能の解決策とはなり得ません。本論文では、大規模言語モデル（LLMs）を活用して科学的な新規性検出（Novelty Detection, ND）を行うことを提案し、マーケティングとNLPの分野における2つの新しいデータセットを関連付けます。NDのための適切なデータセットを構築するために、論文の関係性に基づいてクロージャセットを抽出し、LLMsに基づいてそれらの主要なアイデアを要約することを提案します。アイデアの概念化を捉えるために、LLMsからアイデアレベルの知識を蒸留して軽量な検索器を訓練し、類似した概念化を持つアイデアを整合させることで、LLMによる新規性検出のための効率的かつ正確なアイデア検索を可能にします。実験結果は、提案されたベンチマークデータセットにおいて、アイデア検索とNDタスクにおいて我々の手法が他の手法を一貫して上回ることを示しています。コードとデータはhttps://anonymous.4open.science/r/NoveltyDetection-10FB/で公開されています。

English

In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.