LMSYS-Chat-1M: 대규모 실세계 LLM 대화 데이터셋

초록

실제 환경에서 사람들이 대규모 언어 모델(LLM)과 어떻게 상호작용하는지 연구하는 것은 다양한 애플리케이션에서의 광범위한 사용으로 인해 점점 더 중요해지고 있습니다. 본 논문에서는 25개의 최신 LLM과의 100만 건의 실제 대화를 포함한 대규모 데이터셋인 LMSYS-Chat-1M을 소개합니다. 이 데이터셋은 Vicuna 데모 및 Chatbot Arena 웹사이트에서 21만 개의 고유 IP 주소로부터 수집되었습니다. 우리는 데이터셋의 내용을 개괄적으로 설명하며, 데이터 선별 과정, 기본 통계, 주제 분포를 포함하여 데이터셋의 다양성, 독창성 및 규모를 강조합니다. 또한, GPT-4와 유사한 성능을 보이는 콘텐츠 조정 모델 개발, 안전성 벤치마크 구축, Vicuna와 유사한 성능을 보이는 명령 수행 모델 훈련, 도전적인 벤치마크 질문 생성 등 네 가지 사용 사례를 통해 데이터셋의 다용도성을 입증합니다. 우리는 이 데이터셋이 LLM의 능력을 이해하고 발전시키는 데 유용한 자원으로 활용될 것이라고 믿습니다. 이 데이터셋은 https://huggingface.co/datasets/lmsys/lmsys-chat-1m에서 공개적으로 이용 가능합니다.

English

Studying how people interact with large language models (LLMs) in real-world scenarios is increasingly important due to their widespread use in various applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset containing one million real-world conversations with 25 state-of-the-art LLMs. This dataset is collected from 210K unique IP addresses in the wild on our Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's content, including its curation process, basic statistics, and topic distribution, highlighting its diversity, originality, and scale. We demonstrate its versatility through four use cases: developing content moderation models that perform similarly to GPT-4, building a safety benchmark, training instruction-following models that perform similarly to Vicuna, and creating challenging benchmark questions. We believe that this dataset will serve as a valuable resource for understanding and advancing LLM capabilities. The dataset is publicly available at https://huggingface.co/datasets/lmsys/lmsys-chat-1m.

LMSYS-Chat-1M: 대규모 실세계 LLM 대화 데이터셋

LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset

초록

Support