大規模言語モデルによって合成されたテキストデータセットの言語的多様性の可視化

要旨

大規模言語モデル（LLM）は、few-shotプロンプティングを介して、ベンチマーキング、ファインチューニング、その他のユースケース向けに、より小さく洗練されたデータセットを生成するために使用できます。しかし、これらのデータセットを理解し評価することは困難であり、LLM生成データの失敗モードはまだ十分に理解されていません。具体的には、データは驚くべき方法で繰り返しがちであり、意味的だけでなく、構文的および語彙的にも繰り返されることがあります。本論文では、LLM生成データセットの構文的多様性を理解し分析するための新しいインタラクティブ可視化ツールであるLinguisticLensを紹介します。LinguisticLensは、構文、語彙、意味の軸に沿ってテキストをクラスタリングします。テキストデータセットの階層的可視化をサポートし、ユーザーが迅速に概要をスキャンし、個々の例を検査できるようにします。ライブデモはshorturl.at/zHOUVで利用可能です。

English

Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.

大規模言語モデルによって合成されたテキストデータセットの言語的多様性の可視化

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

要旨

Support