TnT-LLM:大型語言模型的大規模文本探勘
TnT-LLM: Text Mining at Scale with Large Language Models
March 18, 2024
作者: Mengting Wan, Tara Safavi, Sujay Kumar Jauhar, Yujin Kim, Scott Counts, Jennifer Neville, Siddharth Suri, Chirag Shah, Ryen W White, Longqi Yang, Reid Andersen, Georg Buscher, Dhruv Joshi, Nagu Rangan
cs.AI
摘要
將非結構化文本轉換為結構化且有意義的形式,並按照有用的類別標籤進行組織,是文本挖掘中供下游分析和應用的基本步驟。然而,目前大多數現有的方法用於生成標籤分類法和構建基於文本的標籤分類器,仍然嚴重依賴領域專業知識和手動編輯,使得這個過程變得昂貴且耗時。當標籤空間未明確定義且大規模數據標註不可用時,這尤其具有挑戰性。在本文中,我們通過大型語言模型(LLMs)來應對這些挑戰,其基於提示的界面有助於誘導和使用大規模虛擬標籤。我們提出了TnT-LLM,一種兩階段框架,利用LLMs自動化端到端標籤生成和分配過程,對於任何特定用例,僅需最少的人力。在第一階段,我們引入了零-shot、多階段推理方法,使LLMs能夠迭代地生成和精煉標籤分類法。在第二階段,LLMs被用作數據標記者,提供訓練樣本,以便可以可靠地構建、部署和規模化提供輕量級監督分類器。我們將TnT-LLM應用於對Bing Copilot(前身為Bing Chat)的用戶意圖和對話領域的分析,這是一個開放域基於對話的搜索引擎。使用人工和自動評估指標進行的廣泛實驗表明,與最先進的基線相比,TnT-LLM生成的標籤分類法更準確且相關,並在規模化分類的準確性和效率之間取得了良好的平衡。我們還分享了使用LLMs進行大規模文本挖掘在實際應用中的挑戰和機遇的實踐經驗和見解。
English
Transforming unstructured text into structured and meaningful forms,
organized by useful category labels, is a fundamental step in text mining for
downstream analysis and application. However, most existing methods for
producing label taxonomies and building text-based label classifiers still rely
heavily on domain expertise and manual curation, making the process expensive
and time-consuming. This is particularly challenging when the label space is
under-specified and large-scale data annotations are unavailable. In this
paper, we address these challenges with Large Language Models (LLMs), whose
prompt-based interface facilitates the induction and use of large-scale pseudo
labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate
the process of end-to-end label generation and assignment with minimal human
effort for any given use-case. In the first phase, we introduce a zero-shot,
multi-stage reasoning approach which enables LLMs to produce and refine a label
taxonomy iteratively. In the second phase, LLMs are used as data labelers that
yield training samples so that lightweight supervised classifiers can be
reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis
of user intent and conversational domain for Bing Copilot (formerly Bing Chat),
an open-domain chat-based search engine. Extensive experiments using both human
and automatic evaluation metrics demonstrate that TnT-LLM generates more
accurate and relevant label taxonomies when compared against state-of-the-art
baselines, and achieves a favorable balance between accuracy and efficiency for
classification at scale. We also share our practical experiences and insights
on the challenges and opportunities of using LLMs for large-scale text mining
in real-world applications.Summary
AI-Generated Summary