TnT-LLM: 대규모 언어 모델을 활용한 텍스트 마이닝

초록

비정형 텍스트를 구조적이고 의미 있는 형태로 변환하여 유용한 범주 레이블로 조직화하는 것은 다운스트림 분석 및 응용을 위한 텍스트 마이닝의 기본 단계입니다. 그러나 기존의 레이블 분류체계 생성 및 텍스트 기반 레이블 분류기 구축 방법은 여전히 도메인 전문 지식과 수동 큐레이션에 크게 의존하고 있어, 이 과정이 비용이 많이 들고 시간이 소모적입니다. 이는 특히 레이블 공간이 불충분하게 정의되고 대규모 데이터 주석이 없는 경우 더욱 어려운 문제가 됩니다. 본 논문에서는 이러한 문제를 대규모 언어 모델(LLM)을 통해 해결하고자 합니다. LLM의 프롬프트 기반 인터페이스는 대규모 가짜 레이블의 생성과 사용을 용이하게 합니다. 우리는 TnT-LLM이라는 두 단계 프레임워크를 제안하며, 이는 LLM을 활용하여 최소한의 인간 노력으로 주어진 사용 사례에 대한 종단 간 레이블 생성 및 할당 과정을 자동화합니다. 첫 번째 단계에서는 LLM이 반복적으로 레이블 분류체계를 생성하고 개선할 수 있도록 하는 제로샷, 다단계 추론 접근법을 소개합니다. 두 번째 단계에서는 LLM을 데이터 레이블러로 사용하여 경량의 지도 학습 분류기를 안정적으로 구축, 배포 및 대규모로 서비스할 수 있도록 학습 샘플을 생성합니다. 우리는 TnT-LLM을 오픈 도메인 채팅 기반 검색 엔진인 Bing Copilot(이전 Bing Chat)의 사용자 의도 및 대화 도메인 분석에 적용했습니다. 인간 평가 및 자동 평가 지표를 사용한 광범위한 실험을 통해 TnT-LLM이 최신 베이스라인과 비교하여 더 정확하고 관련성 높은 레이블 분류체계를 생성하며, 대규모 분류에서 정확도와 효율성 간의 유리한 균형을 달성함을 입증했습니다. 또한, 실제 애플리케이션에서 대규모 텍스트 마이닝을 위해 LLM을 사용할 때의 도전과 기회에 대한 실용적인 경험과 통찰을 공유합니다.

English

Transforming unstructured text into structured and meaningful forms, organized by useful category labels, is a fundamental step in text mining for downstream analysis and application. However, most existing methods for producing label taxonomies and building text-based label classifiers still rely heavily on domain expertise and manual curation, making the process expensive and time-consuming. This is particularly challenging when the label space is under-specified and large-scale data annotations are unavailable. In this paper, we address these challenges with Large Language Models (LLMs), whose prompt-based interface facilitates the induction and use of large-scale pseudo labels. We propose TnT-LLM, a two-phase framework that employs LLMs to automate the process of end-to-end label generation and assignment with minimal human effort for any given use-case. In the first phase, we introduce a zero-shot, multi-stage reasoning approach which enables LLMs to produce and refine a label taxonomy iteratively. In the second phase, LLMs are used as data labelers that yield training samples so that lightweight supervised classifiers can be reliably built, deployed, and served at scale. We apply TnT-LLM to the analysis of user intent and conversational domain for Bing Copilot (formerly Bing Chat), an open-domain chat-based search engine. Extensive experiments using both human and automatic evaluation metrics demonstrate that TnT-LLM generates more accurate and relevant label taxonomies when compared against state-of-the-art baselines, and achieves a favorable balance between accuracy and efficiency for classification at scale. We also share our practical experiences and insights on the challenges and opportunities of using LLMs for large-scale text mining in real-world applications.

TnT-LLM: 대규모 언어 모델을 활용한 텍스트 마이닝

TnT-LLM: Text Mining at Scale with Large Language Models

초록

Support