大型语言模型合成的文本数据集的语言多样性可视化
Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models
May 19, 2023
作者: Emily Reif, Minsuk Kahng, Savvas Petridis
cs.AI
摘要
大型语言模型(LLMs)可通过少量提示生成更小、更精细的数据集,用于基准测试、微调或其他用途。然而,理解和评估这些数据集是困难的,LLM 生成数据的失败模式仍不为人熟知。具体而言,数据可能以令人惊讶的方式重复,不仅在语义上如此,还在句法和词汇上如此。我们提出了 LinguisticLens,这是一种新颖的交互式可视化工具,用于理解和分析LLM生成数据集的句法多样性。LinguisticLens 可将文本沿着句法、词汇和语义轴进行聚类。它支持文本数据集的分层可视化,使用户能够快速浏览概览并检查单个示例。在线演示可在 shorturl.at/zHOUV 上找到。
English
Large language models (LLMs) can be used to generate smaller, more refined
datasets via few-shot prompting for benchmarking, fine-tuning or other use
cases. However, understanding and evaluating these datasets is difficult, and
the failure modes of LLM-generated data are still not well understood.
Specifically, the data can be repetitive in surprising ways, not only
semantically but also syntactically and lexically. We present LinguisticLens, a
novel inter-active visualization tool for making sense of and analyzing
syntactic diversity of LLM-generated datasets. LinguisticLens clusters text
along syntactic, lexical, and semantic axes. It supports hierarchical
visualization of a text dataset, allowing users to quickly scan for an overview
and inspect individual examples. The live demo is available at
shorturl.at/zHOUV.