利用大型語言模型進行科學新穎性檢測

摘要

在科學呈指數級增長的時代，識別新穎的研究思路對學術界而言至關重要且充滿挑戰。儘管潛力巨大，但缺乏合適的基準數據集阻礙了新穎性檢測的研究。更重要的是，由於文本相似性與創意構思之間的差距，簡單採用現有的自然語言處理技術（如檢索後交叉驗證）並非萬全之策。本文提出利用大型語言模型（LLMs）進行科學新穎性檢測（ND），並結合市場營銷和自然語言處理領域的兩個新數據集。為構建適合ND的細緻數據集，我們建議基於論文間的關係提取閉包集，並利用LLMs總結其主要思想。為捕捉創意構思，我們提出通過從LLMs中蒸餾出思想層面的知識來訓練一個輕量級檢索器，以對齊具有相似構思的創意，從而實現LLM新穎性檢測的高效準確創意檢索。實驗表明，在提出的基準數據集上，我們的方法在創意檢索和ND任務中始終優於其他方法。代碼和數據可在https://anonymous.4open.science/r/NoveltyDetection-10FB/獲取。

English

In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.