ChatPaper.aiChatPaper

利用大型語言模型合成的文本數據集的語言多樣性可視化

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

May 19, 2023
作者: Emily Reif, Minsuk Kahng, Savvas Petridis
cs.AI

摘要

大型語言模型(LLMs)可用於通過少量提示生成更小、更精煉的數據集,以進行基準測試、微調或其他用途。然而,理解和評估這些數據集是困難的,且LLM生成的數據的失敗模式仍不被充分理解。具體而言,數據可能以令人驚訝的方式重複,不僅在語義上如此,還在句法和詞彙上如此。我們提出了LinguisticLens,一種新型的交互式可視化工具,用於理解和分析LLM生成的數據集的句法多樣性。LinguisticLens將文本沿著句法、詞彙和語義軸進行聚類。它支持文本數據集的分層可視化,使用戶可以快速瀏覽概覽並檢查個別示例。現在可以在shorturl.at/zHOUV上查看演示。
English
Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.
PDF21December 15, 2024