WildChat:野外的100万条ChatGPT交互日志
WildChat: 1M ChatGPT Interaction Logs in the Wild
May 2, 2024
作者: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
cs.AI
摘要
像GPT-4和ChatGPT这样的聊天机器人现在为数百万用户提供服务。尽管它们被广泛使用,但仍然缺乏展示这些工具在实践中如何被用户使用的公共数据集。为了弥补这一差距,我们向在线用户提供了免费访问ChatGPT的机会,以交换他们的积极、同意的选择,匿名收集他们的聊天记录和请求头。基于此,我们编制了WildChat,这是一个包含100万个用户-ChatGPT对话的语料库,其中包含超过250万个互动轮次。我们将WildChat与其他流行的用户-聊天机器人互动数据集进行比较,发现我们的数据集提供了最多样化的用户提示,包含最多种语言,并呈现了研究人员研究的潜在有毒用例的最丰富多样性。除了带有时间戳的聊天记录,我们还通过包括州、国家和哈希IP地址以及请求头在内的人口统计数据来丰富数据集。这种增强允许对不同地理区域和时间维度的用户行为进行更详细的分析。最后,由于它涵盖了广泛的用例,我们展示了数据集在微调遵循指令模型中的潜在实用性。WildChat在https://wildchat.allen.ai上以AI2 ImpACT许可证发布。
English
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite
their widespread use, there remains a lack of public datasets showcasing how
these tools are used by a population of users in practice. To bridge this gap,
we offered free access to ChatGPT for online users in exchange for their
affirmative, consensual opt-in to anonymously collect their chat transcripts
and request headers. From this, we compiled WildChat, a corpus of 1 million
user-ChatGPT conversations, which consists of over 2.5 million interaction
turns. We compare WildChat with other popular user-chatbot interaction
datasets, and find that our dataset offers the most diverse user prompts,
contains the largest number of languages, and presents the richest variety of
potentially toxic use-cases for researchers to study. In addition to
timestamped chat transcripts, we enrich the dataset with demographic data,
including state, country, and hashed IP addresses, alongside request headers.
This augmentation allows for more detailed analysis of user behaviors across
different geographical regions and temporal dimensions. Finally, because it
captures a broad range of use cases, we demonstrate the dataset's potential
utility in fine-tuning instruction-following models. WildChat is released at
https://wildchat.allen.ai under AI2 ImpACT Licenses.Summary
AI-Generated Summary