LMSYS-Chat-1M:一個大規模真實世界的LLM對話資料集
LMSYS-Chat-1M: A Large-Scale Real-World LLM Conversation Dataset
September 21, 2023
作者: Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Tianle Li, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zhuohan Li, Zi Lin, Eric. P Xing, Joseph E. Gonzalez, Ion Stoica, Hao Zhang
cs.AI
摘要
由於大型語言模型(LLMs)在各種應用中被廣泛使用,研究人們如何在現實世界中與其互動變得越來越重要。本文介紹了 LMSYS-Chat-1M,這是一個包含一百萬個與 25 個最先進的LLMs進行的現實對話的大規模數據集。該數據集是通過我們的Vicuna演示和Chatbot Arena網站上的 210K 個獨特IP地址在野外收集的。我們概述了數據集的內容,包括其策劃過程、基本統計數據和主題分佈,突出了其多樣性、獨創性和規模。我們通過四個用例展示了其多功能性:開發與GPT-4表現相似的內容審查模型、構建安全基準、訓練與Vicuna表現相似的指令遵循模型,以及創建具有挑戰性的基準問題。我們相信這個數據集將成為理解和推進LLM能力的寶貴資源。該數據集可在以下網址公開獲取:https://huggingface.co/datasets/lmsys/lmsys-chat-1m。
English
Studying how people interact with large language models (LLMs) in real-world
scenarios is increasingly important due to their widespread use in various
applications. In this paper, we introduce LMSYS-Chat-1M, a large-scale dataset
containing one million real-world conversations with 25 state-of-the-art LLMs.
This dataset is collected from 210K unique IP addresses in the wild on our
Vicuna demo and Chatbot Arena website. We offer an overview of the dataset's
content, including its curation process, basic statistics, and topic
distribution, highlighting its diversity, originality, and scale. We
demonstrate its versatility through four use cases: developing content
moderation models that perform similarly to GPT-4, building a safety benchmark,
training instruction-following models that perform similarly to Vicuna, and
creating challenging benchmark questions. We believe that this dataset will
serve as a valuable resource for understanding and advancing LLM capabilities.
The dataset is publicly available at
https://huggingface.co/datasets/lmsys/lmsys-chat-1m.