大型语言模型合成的文本数据集的语言多样性可视化

摘要

大型语言模型（LLMs）可通过少量提示生成更小、更精细的数据集，用于基准测试、微调或其他用途。然而，理解和评估这些数据集是困难的，LLM 生成数据的失败模式仍不为人熟知。具体而言，数据可能以令人惊讶的方式重复，不仅在语义上如此，还在句法和词汇上如此。我们提出了 LinguisticLens，这是一种新颖的交互式可视化工具，用于理解和分析LLM生成数据集的句法多样性。LinguisticLens 可将文本沿着句法、词汇和语义轴进行聚类。它支持文本数据集的分层可视化，使用户能够快速浏览概览并检查单个示例。在线演示可在 shorturl.at/zHOUV 上找到。

English

Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.

大型语言模型合成的文本数据集的语言多样性可视化

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

摘要

Support