ChatPaper.aiChatPaper

DialogStudio:朝著最豐富和最多元化的統一數據集收集,為對話人工智慧邁進。

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI

July 19, 2023
作者: Jianguo Zhang, Kun Qian, Zhiwei Liu, Shelby Heinecke, Rui Meng, Ye Liu, Zhou Yu, Huan Wang, Silvio Savarese, Caiming Xiong
cs.AI

摘要

儘管會話人工智慧取得了進展,語言模型在處理多樣對話任務時仍面臨挑戰,現有的對話資料集常常缺乏多樣性和全面性。為了應對這些問題,我們推出了DialogStudio:這是最大、最多元的對話資料集合,統一採用一致的格式,同時保留其原始資訊。我們的資料集包含來自開放領域對話、任務導向對話、自然語言理解、對話推薦、對話摘要和知識導向對話的資料,使其成為對話研究和模型訓練的極為豐富和多元的資源。為了進一步提升DialogStudio的效用,我們為每個資料集確定了授權許可,並為選定的對話設計了具有領域意識的提示,以便促進指導感知微調。此外,我們利用該資料集合開發了會話人工智慧模型,我們在零樣本和少樣本學習場景中的實驗表明了DialogStudio的優越性。為了提高透明度並支持資料集和任務導向研究,以及語言模型預訓練,與DialogStudio相關的所有資料集、授權許可、程式碼和模型都可在https://github.com/salesforce/DialogStudio 公開獲取。
English
Despite advancements in conversational AI, language models encounter challenges to handle diverse conversational tasks, and existing dialogue dataset collections often lack diversity and comprehensiveness. To tackle these issues, we introduce DialogStudio: the largest and most diverse collection of dialogue datasets, unified under a consistent format while preserving their original information. Our collection encompasses data from open-domain dialogues, task-oriented dialogues, natural language understanding, conversational recommendation, dialogue summarization, and knowledge-grounded dialogues, making it an incredibly rich and diverse resource for dialogue research and model training. To further enhance the utility of DialogStudio, we identify the licenses for each dataset and design domain-aware prompts for selected dialogues to facilitate instruction-aware fine-tuning. Furthermore, we develop conversational AI models using the dataset collection, and our experiments in both zero-shot and few-shot learning scenarios demonstrate the superiority of DialogStudio. To improve transparency and support dataset and task-based research, as well as language model pre-training, all datasets, licenses, codes, and models associated with DialogStudio are made publicly accessible at https://github.com/salesforce/DialogStudio
PDF120December 15, 2024