대규모 언어 모델에 의해 합성된 텍스트 데이터셋의 언어적 다양성 시각화

초록

대규모 언어 모델(LLM)은 소량의 프롬프트를 통해 더 작고 정제된 데이터셋을 생성하여 벤치마킹, 미세 조정 또는 기타 용도로 사용할 수 있습니다. 그러나 이러한 데이터셋을 이해하고 평가하는 것은 어려우며, LLM이 생성한 데이터의 실패 모드는 아직 잘 알려져 있지 않습니다. 특히, 데이터는 의미적으로 뿐만 아니라 구문적으로 그리고 어휘적으로도 놀라운 방식으로 반복될 수 있습니다. 우리는 LLM 생성 데이터셋의 구문적 다양성을 이해하고 분석하기 위한 새로운 인터랙티브 시각화 도구인 LinguisticLens를 소개합니다. LinguisticLens는 텍스트를 구문, 어휘 및 의미 축을 따라 클러스터링합니다. 이 도구는 텍스트 데이터셋의 계층적 시각화를 지원하여 사용자가 빠르게 개요를 스캔하고 개별 예시를 검사할 수 있게 합니다. 라이브 데모는 shorturl.at/zHOUV에서 확인할 수 있습니다.

English

Large language models (LLMs) can be used to generate smaller, more refined datasets via few-shot prompting for benchmarking, fine-tuning or other use cases. However, understanding and evaluating these datasets is difficult, and the failure modes of LLM-generated data are still not well understood. Specifically, the data can be repetitive in surprising ways, not only semantically but also syntactically and lexically. We present LinguisticLens, a novel inter-active visualization tool for making sense of and analyzing syntactic diversity of LLM-generated datasets. LinguisticLens clusters text along syntactic, lexical, and semantic axes. It supports hierarchical visualization of a text dataset, allowing users to quickly scan for an overview and inspect individual examples. The live demo is available at shorturl.at/zHOUV.

대규모 언어 모델에 의해 합성된 텍스트 데이터셋의 언어적 다양성 시각화

Visualizing Linguistic Diversity of Text Datasets Synthesized by Large Language Models

초록

Support