DiaSynth -- 合成對話生成框架
DiaSynth -- Synthetic Dialogue Generation Framework
September 25, 2024
作者: Sathya Krishnan Suresh, Wu Mengjun, Tushar Pranav, Eng Siong Chng
cs.AI
摘要
在各個領域中,從學術話題到日常對話,特定領域對話數據集的稀缺限制了用於各種應用的對話系統的發展。現有研究通常受限於對話數據集要麼過於一般化,要麼是規模不足以滿足訓練對話系統所需規模的專業領域對話數據集。為了彌補這一差距,我們引入了DiaSynth - 一個合成對話生成框架,能夠跨越廣泛的領域生成高質量、具有情境豐富的對話。我們的方法與現有框架不同,通過動態生成對話,將模擬的人物、子話題和多樣的對話特徵融入其中,使用具有思維鏈 (CoT) 推理的大型語言模型 (LLM) 來創建情境豐富、特定領域的對話,以模擬自然的人類互動。DiaSynth生成符合實際對話的定制對話。我們通過使用不同的LLM和DialogSum以及SAMSum的少樣本示例生成合成數據來進行實驗。在合成數據上微調的預訓練語言模型的性能優於基本模型16.47%,而在領域內數據和合成數據上微調的模型之間的比較顯示,合成數據能夠捕捉到領域內數據的90.48%分佈。生成的數據質量也隨著LLM的大小而提高。這些結果驗證了DiaSynth作為傳統數據收集方法的堅固替代方案的潛力。
English
The scarcity of domain specific dialogue datasets across various domains,
from academic topics to everyday conversations, limits the development of
dialogue systems for various applications. Existing research is often
constrained either by dialogue datasets that are too general or by niche domain
dialogue datasets whose scale does not match the required scale for training
dialogue systems. To address this gap, we introduce DiaSynth - a synthetic
dialogue generation framework capable of generating high quality, contextually
rich dialogues across a wide range of domains. Our approach differs from
existing frameworks by dynamically generating dialogues that incorporate
simulated personas, subtopics, and diverse conversational characteristics,
using a Large Language Model (LLM) with Chain of Thought (CoT) reasoning to
create contextually rich, domain-specific dialogues that closely mimic natural
human interactions. DiaSynth produces tailored dialogues that emulate realistic
conversations. We perform our experiments by generating synthetic data using
different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained
language models fine-tuned on the synthetic data outperform the base models by
16.47%, while the comparison between models fine-tuned on in-domain data and
synthetic data shows that the synthetic data is able to capture 90.48% of the
distribution of the in-domain data. The quality of the data generated also
scales with the size of LLMs. These results validate DiaSynth's potential as a
robust alternative to traditional data collection methods.Summary
AI-Generated Summary