ChatPaper.aiChatPaper

WildChat:野外的100万条ChatGPT交互日志

WildChat: 1M ChatGPT Interaction Logs in the Wild

May 2, 2024
作者: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
cs.AI

摘要

像GPT-4和ChatGPT这样的聊天机器人现在为数百万用户提供服务。尽管它们被广泛使用,但仍然缺乏展示这些工具在实践中如何被用户使用的公共数据集。为了弥补这一差距,我们向在线用户提供了免费访问ChatGPT的机会,以交换他们的积极、同意的选择,匿名收集他们的聊天记录和请求头。基于此,我们编制了WildChat,这是一个包含100万个用户-ChatGPT对话的语料库,其中包含超过250万个互动轮次。我们将WildChat与其他流行的用户-聊天机器人互动数据集进行比较,发现我们的数据集提供了最多样化的用户提示,包含最多种语言,并呈现了研究人员研究的潜在有毒用例的最丰富多样性。除了带有时间戳的聊天记录,我们还通过包括州、国家和哈希IP地址以及请求头在内的人口统计数据来丰富数据集。这种增强允许对不同地理区域和时间维度的用户行为进行更详细的分析。最后,由于它涵盖了广泛的用例,我们展示了数据集在微调遵循指令模型中的潜在实用性。WildChat在https://wildchat.allen.ai上以AI2 ImpACT许可证发布。
English
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.

Summary

AI-Generated Summary

PDF631December 15, 2024