TaxoAdapt:将基于大语言模型的多维分类体系构建与动态研究语料库对齐
TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora
June 12, 2025
作者: Priyanka Kargupta, Nan Zhang, Yunyi Zhang, Rui Zhang, Prasenjit Mitra, Jiawei Han
cs.AI
摘要
科学领域的快速发展给科学文献的组织与检索带来了挑战。尽管专家精心构建的分类体系传统上满足了这一需求,但这一过程耗时且成本高昂。此外,现有的自动分类构建方法要么(1)过度依赖特定语料库,牺牲了普适性,要么(2)过分倚重大型语言模型(LLMs)预训练数据集中的通用知识,往往忽视了科学领域动态演变的特性。同时,这些方法未能充分考虑科学文献的多维性,即单篇研究论文可能涉及多个维度(如方法论、新任务、评估指标、基准测试)。为填补这些空白,我们提出了TaxoAdapt框架,该框架能够动态调整LLM生成的分类体系,使其适应给定语料库的多维度特征。TaxoAdapt通过迭代的层次分类,依据语料库的主题分布扩展分类的广度和深度。我们通过展示其在多年间多个计算机科学会议上的卓越表现,证明了其构建并捕捉科学领域演变的能力。作为一种多维方法,TaxoAdapt生成的分类体系在LLM评估下,比最具竞争力的基线方法在粒度保持上提升了26.51%,在连贯性上提升了50.41%。
English
The rapid evolution of scientific fields introduces challenges in organizing
and retrieving scientific literature. While expert-curated taxonomies have
traditionally addressed this need, the process is time-consuming and expensive.
Furthermore, recent automatic taxonomy construction methods either (1)
over-rely on a specific corpus, sacrificing generalizability, or (2) depend
heavily on the general knowledge of large language models (LLMs) contained
within their pre-training datasets, often overlooking the dynamic nature of
evolving scientific domains. Additionally, these approaches fail to account for
the multi-faceted nature of scientific literature, where a single research
paper may contribute to multiple dimensions (e.g., methodology, new tasks,
evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a
framework that dynamically adapts an LLM-generated taxonomy to a given corpus
across multiple dimensions. TaxoAdapt performs iterative hierarchical
classification, expanding both the taxonomy width and depth based on corpus'
topical distribution. We demonstrate its state-of-the-art performance across a
diverse set of computer science conferences over the years to showcase its
ability to structure and capture the evolution of scientific fields. As a
multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more
granularity-preserving and 50.41% more coherent than the most competitive
baselines judged by LLMs.