ChatPaper.aiChatPaper

TaxoAdapt:将基于LLM的多维度分类体系构建与演进中的研究语料库对齐

TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora

June 12, 2025
作者: Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
cs.AI

摘要

科學領域的快速演進為組織與檢索科學文獻帶來了挑戰。雖然專家策劃的分類法傳統上滿足了這一需求,但這一過程既耗時又昂貴。此外,近期的自動分類法構建方法要么(1)過度依賴特定語料庫,犧牲了通用性,要么(2)過分依賴大型語言模型(LLMs)在其預訓練數據集中所包含的通用知識,往往忽視了科學領域動態變化的特性。同時,這些方法未能考慮到科學文獻的多維特性,即單篇研究論文可能對多個維度(如方法論、新任務、評估指標、基準測試)有所貢獻。為填補這些空白,我們提出了TaxoAdapt框架,該框架能夠動態地將LLM生成的分類法適應於給定語料庫的多個維度。TaxoAdapt執行迭代的層次分類,根據語料庫的主題分佈擴展分類法的廣度與深度。我們通過展示其在多年來多樣化計算機科學會議上的頂尖性能,證明了其結構化並捕捉科學領域演變的能力。作為一種多維方法,TaxoAdapt生成的分類法在LLMs評判下,比最具競爭力的基線方法保留了26.51%更細粒度的信息,且連貫性提高了50.41%。
English
The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.
PDF22June 13, 2025