利用大型語言模型合成的文本數據集的語言多樣性可視化

摘要

大型語言模型（LLMs）可用於通過少量提示生成更小、更精煉的數據集，以進行基準測試、微調或其他用途。然而，理解和評估這些數據集是困難的，且LLM生成的數據的失敗模式仍不被充分理解。具體而言，數據可能以令人驚訝的方式重複，不僅在語義上如此，還在句法和詞彙上如此。我們提出了LinguisticLens，一種新型的交互式可視化工具，用於理解和分析LLM生成的數據集的句法多樣性。LinguisticLens將文本沿著句法、詞彙和語義軸進行聚類。它支持文本數據集的分層可視化，使用戶可以快速瀏覽概覽並檢查個別示例。現在可以在shorturl.at/zHOUV上查看演示。

English

Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.

利用大型語言模型合成的文本數據集的語言多樣性可視化

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

摘要

Support