WildChat：来自真实场景的100万条ChatGPT交互日志

摘要

当前，GPT-4和ChatGPT等聊天机器人正为数百万用户提供服务。尽管这些工具已被广泛使用，但学术界仍缺乏能够真实反映用户群体实际使用情况的公开数据集。为填补这一空白，我们向在线用户提供免费ChatGPT访问权限，并通过用户主动同意的授权方式，匿名收集其聊天记录和请求头信息。基于此，我们构建了WildChat语料库——一个包含100万次用户与ChatGPT对话的数据集，涵盖超过250万轮交互内容。通过与其他主流用户-聊天机器人交互数据集的对比，我们发现WildChat具有最丰富的用户提问类型、最多样化的语言种类，并为研究者提供了最全面的潜在有害使用场景样本。除时间戳标记的对话记录外，我们还为数据集补充了用户地域信息（包括州/省、国家）和哈希处理的IP地址及请求头信息。这一增强功能使得跨地域维度和时间维度的用户行为细粒度分析成为可能。最后，由于该数据集覆盖了广泛的使用场景，我们验证了其在指令跟随模型微调方面的潜在应用价值。WildChat数据集已在https://wildchat.allen.ai发布，采用AI2 ImpACT许可协议。

English

Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite their widespread use, there remains a lack of public datasets showcasing how these tools are used by a population of users in practice. To bridge this gap, we offered free access to ChatGPT for online users in exchange for their affirmative, consensual opt-in to anonymously collect their chat transcripts and request headers. From this, we compiled WildChat, a corpus of 1 million user-ChatGPT conversations, which consists of over 2.5 million interaction turns. We compare WildChat with other popular user-chatbot interaction datasets, and find that our dataset offers the most diverse user prompts, contains the largest number of languages, and presents the richest variety of potentially toxic use-cases for researchers to study. In addition to timestamped chat transcripts, we enrich the dataset with demographic data, including state, country, and hashed IP addresses, alongside request headers. This augmentation allows for more detailed analysis of user behaviors across different geographical regions and temporal dimensions. Finally, because it captures a broad range of use cases, we demonstrate the dataset's potential utility in fine-tuning instruction-following models. WildChat is released at https://wildchat.allen.ai under AI2 ImpACT Licenses.

WildChat：来自真实场景的100万条ChatGPT交互日志

WildChat: 1M ChatGPT Interaction Logs in the Wild

摘要

Support