WildChat:野外的 1M ChatGPT 互動日誌
WildChat: 1M ChatGPT Interaction Logs in the Wild
May 2, 2024
作者: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
cs.AI
摘要
像GPT-4和ChatGPT這樣的聊天機器人現在正在為數百萬用戶提供服務。儘管它們被廣泛使用,但仍然缺乏展示這些工具在實際使用中如何被一群用戶使用的公共數據集。為了彌補這一差距,我們提供了免費訪問ChatGPT的機會給線上用戶,以交換他們的肯定性、同意性的選擇,匿名收集他們的聊天記錄和請求標頭。從中,我們編制了WildChat,一個包含100萬個用戶-ChatGPT對話的語料庫,其中包含超過250萬個互動轉換。我們將WildChat與其他流行的用戶-聊天機器人互動數據集進行比較,發現我們的數據集提供了最多樣化的用戶提示,包含最多種語言,並呈現了研究人員研究的潛在有毒使用案例的最豐富變化。除了有時間戳的聊天記錄外,我們還豐富了數據集的人口統計數據,包括州、國家和經過雜湊處理的IP地址,以及請求標頭。這種增強使得可以更詳細地分析不同地理區域和時間維度下的用戶行為。最後,由於它涵蓋了廣泛的用例範圍,我們展示了數據集在微調遵循指令模型中的潛在效用。WildChat在https://wildchat.allen.ai上以AI2 ImpACT許可證發布。
English
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite
their widespread use, there remains a lack of public datasets showcasing how
these tools are used by a population of users in practice. To bridge this gap,
we offered free access to ChatGPT for online users in exchange for their
affirmative, consensual opt-in to anonymously collect their chat transcripts
and request headers. From this, we compiled WildChat, a corpus of 1 million
user-ChatGPT conversations, which consists of over 2.5 million interaction
turns. We compare WildChat with other popular user-chatbot interaction
datasets, and find that our dataset offers the most diverse user prompts,
contains the largest number of languages, and presents the richest variety of
potentially toxic use-cases for researchers to study. In addition to
timestamped chat transcripts, we enrich the dataset with demographic data,
including state, country, and hashed IP addresses, alongside request headers.
This augmentation allows for more detailed analysis of user behaviors across
different geographical regions and temporal dimensions. Finally, because it
captures a broad range of use cases, we demonstrate the dataset's potential
utility in fine-tuning instruction-following models. WildChat is released at
https://wildchat.allen.ai under AI2 ImpACT Licenses.Summary
AI-Generated Summary