利用大语言模型进行科学新颖性检测

摘要

在科学呈指数级增长的时代，识别新颖的研究思路对学术界而言至关重要且充满挑战。尽管潜力巨大，但缺乏合适的基准数据集阻碍了新颖性检测的研究。更重要的是，由于文本相似性与思想概念之间的差异，简单地采用现有的自然语言处理技术（如检索后交叉验证）并非放之四海而皆准的解决方案。本文提出利用大型语言模型（LLMs）进行科学新颖性检测（ND），并引入了市场营销和自然语言处理领域的两套新数据集。为构建适用于ND的细致数据集，我们建议基于论文间的关系提取其闭包集，并借助LLMs总结其主要思想。为捕捉思想概念，我们提出通过从LLMs中蒸馏思想层面的知识来训练一个轻量级检索器，以对齐具有相似概念的思想，从而为LLM的新颖性检测实现高效且准确的思想检索。实验表明，在提出的基准数据集上，我们的方法在思想检索和ND任务中持续优于其他方法。代码与数据可在https://anonymous.4open.science/r/NoveltyDetection-10FB/获取。

English

In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.