TaxoAdapt：将基于LLM的多维度分类体系构建与演进中的研究语料库对齐

摘要

科學領域的快速演進為組織與檢索科學文獻帶來了挑戰。雖然專家策劃的分類法傳統上滿足了這一需求，但這一過程既耗時又昂貴。此外，近期的自動分類法構建方法要么（1）過度依賴特定語料庫，犧牲了通用性，要么（2）過分依賴大型語言模型（LLMs）在其預訓練數據集中所包含的通用知識，往往忽視了科學領域動態變化的特性。同時，這些方法未能考慮到科學文獻的多維特性，即單篇研究論文可能對多個維度（如方法論、新任務、評估指標、基準測試）有所貢獻。為填補這些空白，我們提出了TaxoAdapt框架，該框架能夠動態地將LLM生成的分類法適應於給定語料庫的多個維度。TaxoAdapt執行迭代的層次分類，根據語料庫的主題分佈擴展分類法的廣度與深度。我們通過展示其在多年來多樣化計算機科學會議上的頂尖性能，證明了其結構化並捕捉科學領域演變的能力。作為一種多維方法，TaxoAdapt生成的分類法在LLMs評判下，比最具競爭力的基線方法保留了26.51%更細粒度的信息，且連貫性提高了50.41%。

English

The rapid evolution of scientific fields introduces challenges in organizing and retrieving scientific literature. While expert-curated taxonomies have traditionally addressed this need, the process is time-consuming and expensive. Furthermore, recent automatic taxonomy construction methods either (1) over-rely on a specific corpus, sacrificing generalizability, or (2) depend heavily on the general knowledge of large language models (LLMs) contained within their pre-training datasets, often overlooking the dynamic nature of evolving scientific domains. Additionally, these approaches fail to account for the multi-faceted nature of scientific literature, where a single research paper may contribute to multiple dimensions (e.g., methodology, new tasks, evaluation metrics, benchmarks). To address these gaps, we propose TaxoAdapt, a framework that dynamically adapts an LLM-generated taxonomy to a given corpus across multiple dimensions. TaxoAdapt performs iterative hierarchical classification, expanding both the taxonomy width and depth based on corpus' topical distribution. We demonstrate its state-of-the-art performance across a diverse set of computer science conferences over the years to showcase its ability to structure and capture the evolution of scientific fields. As a multidimensional method, TaxoAdapt generates taxonomies that are 26.51% more granularity-preserving and 50.41% more coherent than the most competitive baselines judged by LLMs.

TaxoAdapt：将基于LLM的多维度分类体系构建与演进中的研究语料库对齐

TaxoAdapt: Aligning LLM-Based Multidimensional Taxonomy Construction to Evolving Research Corpora

摘要

Support