TnT-LLM：大规模语言模型的文本挖掘

摘要

将非结构化文本转化为结构化且有意义的形式，通过有用的类别标签进行组织，是文本挖掘中用于下游分析和应用的基本步骤。然而，目前大多数现有方法用于生成标签分类法和构建基于文本的标签分类器仍然严重依赖领域专业知识和手动策划，使得这一过程昂贵且耗时。当标签空间未明确定义且大规模数据注释不可用时，这一挑战尤为严峻。在本文中，我们利用大型语言模型（LLMs）应对这些挑战，其基于提示的界面有助于诱导和利用大规模伪标签。我们提出了TnT-LLM，一个两阶段框架，利用LLMs自动化端到端标签生成和分配过程，对于任何给定用例减少人力投入。在第一阶段，我们引入了一种零样本、多阶段推理方法，使LLMs能够迭代地生成和完善标签分类法。在第二阶段，LLMs被用作数据标注器，产生训练样本，从而可以可靠地构建、部署和大规模提供轻量级监督分类器。我们将TnT-LLM应用于用户意图和Bing Copilot（前身为Bing Chat）的对话领域分析，这是一个开放域基于聊天的搜索引擎。通过使用人工和自动评估指标进行广泛实验，证明了TnT-LLM相对于最先进基线方法生成更准确和相关的标签分类法，并在大规模分类中实现了准确性和效率之间的有利平衡。我们还分享了在实际应用中使用LLMs进行大规模文本挖掘时的挑战和机遇的实践经验和见解。

English

Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.

TnT-LLM：大规模语言模型的文本挖掘

TnT-LLM: Text Mining at Scale with Large Language Models

摘要

Support