과학적 신규성 탐지를 위한 대규모 언어 모델 활용

초록

과학적 성장이 기하급수적으로 이루어지는 시대에, 새로운 연구 아이디어를 식별하는 것은 학계에서 중요하면서도 도전적인 과제입니다. 잠재력이 있음에도 불구하고, 적절한 벤치마크 데이터셋의 부재는 신규성 탐지 연구를 방해하고 있습니다. 더욱 중요한 것은, 기존의 자연어 처리 기술(예: 검색 후 교차 검증)을 단순히 적용하는 것이 텍스트 유사성과 아이디어 개념 간의 차이로 인해 만능 해결책이 될 수 없다는 점입니다. 본 논문에서는 대규모 언어 모델(LLMs)을 활용하여 과학적 신규성 탐지(ND)를 수행하고, 마케팅과 자연어 처리 분야의 두 가지 새로운 데이터셋을 제안합니다. 신규성 탐지를 위한 신중하게 구성된 데이터셋을 구축하기 위해, 논문 간의 관계를 기반으로 클로저 집합을 추출하고, 이를 LLMs를 통해 주요 아이디어를 요약하는 방법을 제안합니다. 아이디어 개념을 포착하기 위해, LLMs로부터 아이디어 수준의 지식을 추출하여 유사한 개념을 가진 아이디어를 정렬하는 경량 검색기를 훈련시키는 방법을 제안합니다. 이를 통해 LLM 신규성 탐지를 위한 효율적이고 정확한 아이디어 검색이 가능해집니다. 실험 결과, 제안된 벤치마크 데이터셋에서 아이디어 검색 및 신규성 탐지 작업에서 우리의 방법이 다른 방법들을 지속적으로 능가함을 보여줍니다. 코드와 데이터는 https://anonymous.4open.science/r/NoveltyDetection-10FB/에서 확인할 수 있습니다.

English

In an era of exponential scientific growth, identifying novel research ideas is crucial and challenging in academia. Despite potential, the lack of an appropriate benchmark dataset hinders the research of novelty detection. More importantly, simply adopting existing NLP technologies, e.g., retrieving and then cross-checking, is not a one-size-fits-all solution due to the gap between textual similarity and idea conception. In this paper, we propose to harness large language models (LLMs) for scientific novelty detection (ND), associated with two new datasets in marketing and NLP domains. To construct the considerate datasets for ND, we propose to extract closure sets of papers based on their relationship, and then summarize their main ideas based on LLMs. To capture idea conception, we propose to train a lightweight retriever by distilling the idea-level knowledge from LLMs to align ideas with similar conception, enabling efficient and accurate idea retrieval for LLM novelty detection. Experiments show our method consistently outperforms others on the proposed benchmark datasets for idea retrieval and ND tasks. Codes and data are available at https://anonymous.4open.science/r/NoveltyDetection-10FB/.