在保持准确性的同时增加多样性:利用大型语言模型和人类干预进行文本数据生成
Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions
June 7, 2023
作者: John Joon Young Chung, Ece Kamar, Saleema Amershi
cs.AI
摘要
大型语言模型(LLMs)可用于生成文本数据,用于训练和评估其他模型。然而,利用LLMs创建高质量数据集可能具有挑战性。在这项工作中,我们探讨人工智能与人类合作,以促进基于LLMs的文本数据生成的高多样性和准确性。我们首先研究了两种增加文本生成多样性的方法:1)对数抑制,可以减少已经频繁生成的语言的生成,2)温度抽样,可以使标记抽样概率变得平坦。我们发现,多样化方法可以增加数据多样性,但往往以数据准确性为代价(即文本和标签是否适合目标领域)。为了解决这个问题,我们研究了两种人类干预方法,1)标签替换(LR),用于纠正不对齐的标签,2)超出范围过滤(OOSF),用于删除用户不感兴趣的实例或无相关标签适用的实例。通过专家研究,我们发现LR可以使通过多样化数据集训练的模型的绝对准确性提高14.4%。此外,我们发现,一些通过LR干预生成的数据训练的模型表现优于基于LLM的少样本分类。相比之下,OOSF并未有效提高模型准确性,这意味着需要在人机协作文本数据生成领域进行未来工作。
English
Large language models (LLMs) can be used to generate text data for training
and evaluating other models. However, creating high-quality datasets with LLMs
can be challenging. In this work, we explore human-AI partnerships to
facilitate high diversity and accuracy in LLM-based text data generation. We
first examine two approaches to diversify text generation: 1) logit
suppression, which minimizes the generation of languages that have already been
frequently generated, and 2) temperature sampling, which flattens the token
sampling probability. We found that diversification approaches can increase
data diversity but often at the cost of data accuracy (i.e., text and labels
being appropriate for the target domain). To address this issue, we examined
two human interventions, 1) label replacement (LR), correcting misaligned
labels, and 2) out-of-scope filtering (OOSF), removing instances that are out
of the user's domain of interest or to which no considered label applies. With
oracle studies, we found that LR increases the absolute accuracy of models
trained with diversified datasets by 14.4%. Moreover, we found that some models
trained with data generated with LR interventions outperformed LLM-based
few-shot classification. In contrast, OOSF was not effective in increasing
model accuracy, implying the need for future work in human-in-the-loop text
data generation.