TaxoAdapt:将基于LLM的多维度分类体系构建与演进中的研究语料库对齐
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
June 12, 2025
作者: Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
cs.AI
摘要
科學領域的快速演進為組織與檢索科學文獻帶來了挑戰。雖然專家策劃的分類法傳統上滿足了這一需求,但這一過程既耗時又昂貴。此外,近期的自動分類法構建方法要么(1)過度依賴特定語料庫,犧牲了通用性,要么(2)過分依賴大型語言模型(LLMs)在其預訓練數據集中所包含的通用知識,往往忽視了科學領域動態變化的特性。同時,這些方法未能考慮到科學文獻的多維特性,即單篇研究論文可能對多個維度(如方法論、新任務、評估指標、基準測試)有所貢獻。為填補這些空白,我們提出了TaxoAdapt框架,該框架能夠動態地將LLM生成的分類法適應於給定語料庫的多個維度。TaxoAdapt執行迭代的層次分類,根據語料庫的主題分佈擴展分類法的廣度與深度。我們通過展示其在多年來多樣化計算機科學會議上的頂尖性能,證明了其結構化並捕捉科學領域演變的能力。作為一種多維方法,TaxoAdapt生成的分類法在LLMs評判下,比最具競爭力的基線方法保留了26.51%更細粒度的信息,且連貫性提高了50.41%。
English
The rapid evolution of scientific fields introduces challenges in organizing
and retrieving scientific literature. While expert-curated taxonomies have
traditionally addressed this need, the process is time-consuming and expensive.
Furthermore, recent automatic taxonomy construction methods either (1)
over-rely on a specific corpus, sacrificing generalizability, or (2) depend
heavily on the general knowledge of large language models (LLMs) contained
within their pre-training datasets, often overlooking the dynamic nature of
evolving scientific domains. Additionally, these approaches fail to account for
the multi-faceted nature of scientific literature, where a single research
paper may contribute to multiple dimensions (e.g., methodology, new tasks,
evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a
framework that dynamically adapts an LLM-generated taxonomy to a given corpus
across multiple dimensions. TaxoAdapt performs iterative hierarchical
classification, expanding both the taxonomy width and depth based on corpus'
topical distribution. We demonstrate its state-of-the-art performance across a
diverse set of computer science conferences over the years to showcase its
ability to structure and capture the evolution of scientific fields. As a
multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more
granularity-preserving and 50.41% more coherent than the most competitive
baselines judged by LLMs.