ChatPaper.aiChatPaper

DiaSynth -- 一种合成对话生成框架

DiaSynth -- Synthetic Dialogue Generation Framework

September 25, 2024
作者: Sathya Krishnan Suresh, Wu Mengjun, Tushar Pranav, Eng Siong Chng
cs.AI

摘要

在各个领域中,从学术话题到日常对话,特定领域对话数据集的稀缺性限制了用于各种应用的对话系统的发展。现有研究往往受限于对话数据集要么过于通用,要么是规模不足以训练对话系统所需的规模的利基领域对话数据集。为了填补这一空白,我们引入了DiaSynth - 一种合成对话生成框架,能够跨越各种领域生成高质量、上下文丰富的对话。我们的方法与现有框架不同,通过动态生成对话,结合模拟人物、子主题和多样化的会话特征,利用具有“思维链”推理的大型语言模型(LLM)创建上下文丰富、特定领域的对话,以紧密模仿自然人类互动。DiaSynth生成模拟真实对话的定制对话。我们通过使用不同的LLM和来自DialogSum和SAMSum的少样本示例生成合成数据来进行实验。在合成数据上微调的预训练语言模型的性能优于基础模型16.47%,而在领域内数据和合成数据上微调的模型之间的比较表明,合成数据能够捕捉领域内数据分布的90.48%。生成数据的质量也随着LLM的规模而提高。这些结果验证了DiaSynth作为传统数据收集方法的强大替代方案的潜力。
English
The scarcity of domain specific dialogue datasets across various domains, from academic topics to everyday conversations, limits the development of dialogue systems for various applications. Existing research is often constrained either by dialogue datasets that are too general or by niche domain dialogue datasets whose scale does not match the required scale for training dialogue systems. To address this gap, we introduce DiaSynth - a synthetic dialogue generation framework capable of generating high quality, contextually rich dialogues across a wide range of domains. Our approach differs from existing frameworks by dynamically generating dialogues that incorporate simulated personas, subtopics, and diverse conversational characteristics, using a Large Language Model (LLM) with Chain of Thought (CoT) reasoning to create contextually rich, domain-specific dialogues that closely mimic natural human interactions. DiaSynth produces tailored dialogues that emulate realistic conversations. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained language models fine-tuned on the synthetic data outperform the base models by 16.47%, while the comparison between models fine-tuned on in-domain data and synthetic data shows that the synthetic data is able to capture 90.48% of the distribution of the in-domain data. The quality of the data generated also scales with the size of LLMs. These results validate DiaSynth's potential as a robust alternative to traditional data collection methods.

Summary

AI-Generated Summary

PDF213November 13, 2024