WildChat:來自真實世界的百萬次 ChatGPT 互動紀錄
WildChat: 1M ChatGPT Interaction Logs in the Wild
May 2, 2024
作者: Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, Yuntian Deng
cs.AI
摘要
諸如GPT-4和ChatGPT等聊天機器人現已服務數百萬用戶。儘管這些工具已被廣泛使用,但目前仍缺乏能展現實際用戶群體如何使用這些工具的公開數據集。為彌補這一空白,我們向線上用戶提供免費的ChatGPT使用權限,以換取他們主動同意以匿名方式收集其聊天記錄和請求標頭。據此我們構建了WildChat數據集——一個包含100萬次用戶與ChatGPT對話的語料庫,涵蓋逾250萬次交互輪次。通過與其他常用用戶-聊天機器人交互數據集比較,我們發現該數據集具備最多樣的用戶提示、最豐富的語言種類,並為研究人員提供了最具多樣性的潛在有害使用案例。除了帶時間戳的聊天記錄,我們還強化了數據集的人口統計資料(包括州別、國別和哈希處理的IP地址)及請求標頭。這種增強設計有助於對不同地理區域和時間維度的用戶行為進行更精細分析。最後,由於數據集涵蓋廣泛的使用場景,我們驗證了其在微調指令遵循模型方面的潛在應用價值。WildChat已通過AI2 ImpACT許可協議發佈於https://wildchat.allen.ai。
English
Chatbots such as GPT-4 and ChatGPT are now serving millions of users. Despite
their widespread use, there remains a lack of public datasets showcasing how
these tools are used by a population of users in practice. To bridge this gap,
we offered free access to ChatGPT for online users in exchange for their
affirmative, consensual opt-in to anonymously collect their chat transcripts
and request headers. From this, we compiled WildChat, a corpus of 1 million
user-ChatGPT conversations, which consists of over 2.5 million interaction
turns. We compare WildChat with other popular user-chatbot interaction
datasets, and find that our dataset offers the most diverse user prompts,
contains the largest number of languages, and presents the richest variety of
potentially toxic use-cases for researchers to study. In addition to
timestamped chat transcripts, we enrich the dataset with demographic data,
including state, country, and hashed IP addresses, alongside request headers.
This augmentation allows for more detailed analysis of user behaviors across
different geographical regions and temporal dimensions. Finally, because it
captures a broad range of use cases, we demonstrate the dataset's potential
utility in fine-tuning instruction-following models. WildChat is released at
https://wildchat.allen.ai under AI2 ImpACT Licenses.