ChatPaper.aiChatPaper

DiaSynth -- 合成對話生成框架

DiaSynth -- Synthetic Dialogue Generation Framework

September 25, 2024
作者: Sathya Krishnan Suresh, Wu Mengjun, Tushar Pranav, Eng Siong Chng
cs.AI

摘要

在各個領域中,從學術話題到日常對話,特定領域對話數據集的稀缺限制了用於各種應用的對話系統的發展。現有研究通常受限於對話數據集要麼過於一般化,要麼是規模不足以滿足訓練對話系統所需規模的專業領域對話數據集。為了彌補這一差距,我們引入了DiaSynth - 一個合成對話生成框架,能夠跨越廣泛的領域生成高質量、具有情境豐富的對話。我們的方法與現有框架不同,通過動態生成對話,將模擬的人物、子話題和多樣的對話特徵融入其中,使用具有思維鏈 (CoT) 推理的大型語言模型 (LLM) 來創建情境豐富、特定領域的對話,以模擬自然的人類互動。DiaSynth生成符合實際對話的定制對話。我們通過使用不同的LLM和DialogSum以及SAMSum的少樣本示例生成合成數據來進行實驗。在合成數據上微調的預訓練語言模型的性能優於基本模型16.47%,而在領域內數據和合成數據上微調的模型之間的比較顯示,合成數據能夠捕捉到領域內數據的90.48%分佈。生成的數據質量也隨著LLM的大小而提高。這些結果驗證了DiaSynth作為傳統數據收集方法的堅固替代方案的潛力。
English
The scarcity of domain specific dialogue datasets across various domains, from academic topics to everyday conversations, limits the development of dialogue systems for various applications. Existing research is often constrained either by dialogue datasets that are too general or by niche domain dialogue datasets whose scale does not match the required scale for training dialogue systems. To address this gap, we introduce DiaSynth - a synthetic dialogue generation framework capable of generating high quality, contextually rich dialogues across a wide range of domains. Our approach differs from existing frameworks by dynamically generating dialogues that incorporate simulated personas, subtopics, and diverse conversational characteristics, using a Large Language Model (LLM) with Chain of Thought (CoT) reasoning to create contextually rich, domain-specific dialogues that closely mimic natural human interactions. DiaSynth produces tailored dialogues that emulate realistic conversations. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained language models fine-tuned on the synthetic data outperform the base models by 16.47%, while the comparison between models fine-tuned on in-domain data and synthetic data shows that the synthetic data is able to capture 90.48% of the distribution of the in-domain data. The quality of the data generated also scales with the size of LLMs. These results validate DiaSynth's potential as a robust alternative to traditional data collection methods.

Summary

AI-Generated Summary

PDF213November 13, 2024