DiaSynth -- 対話生成フレームワーク

要旨

様々な領域にわたる特定の対話データセットの希少性は、学術的なトピックから日常会話まで、さまざまなアプリケーションのための対話システムの開発を制限しています。既存の研究は、対話データセットがあまり一般的すぎるか、必要なスケールのトレーニング用データと一致しないニッチな領域の対話データセットによって制約されることがよくあります。このギャップに対処するために、DiaSynthを導入します。DiaSynthは、幅広い領域で高品質で文脈豊かな対話を生成できる合成対話生成フレームワークです。私たちのアプローチは、自然な人間の対話を密接に模倣する文脈豊かで特定の領域に密接な対話を作成するために、Chain of Thought（CoT）推論を用いたLarge Language Model（LLM）を用いて、シミュレートされたペルソナ、サブトピック、多様な会話特性を取り入れた対話を動的に生成する点で既存のフレームワークと異なります。DiaSynthは、現実的な会話を模倣するカスタマイズされた対話を生成します。私たちは、DialogSumとSAMSumからのfew-shot例を使用して合成データを生成することで実験を行います。合成データでファインチューニングされた事前学習言語モデルは、ベースモデルを16.47%上回ります。また、ドメイン内データと合成データでファインチューニングされたモデルの比較では、合成データがドメイン内データの分布の90.48%を捉えることができることが示されます。生成されたデータの品質もLLMのサイズとともに向上します。これらの結果は、DiaSynthが従来のデータ収集方法に対する堅牢な代替手段としての潜在能力を検証しています。

English

The scarcity of domain specific dialogue datasets across various domains, from academic topics to everyday conversations, limits the development of dialogue systems for various applications. Existing research is often constrained either by dialogue datasets that are too general or by niche domain dialogue datasets whose scale does not match the required scale for training dialogue systems. To address this gap, we introduce DiaSynth - a synthetic dialogue generation framework capable of generating high quality, contextually rich dialogues across a wide range of domains. Our approach differs from existing frameworks by dynamically generating dialogues that incorporate simulated personas, subtopics, and diverse conversational characteristics, using a Large Language Model (LLM) with Chain of Thought (CoT) reasoning to create contextually rich, domain-specific dialogues that closely mimic natural human interactions. DiaSynth produces tailored dialogues that emulate realistic conversations. We perform our experiments by generating synthetic data using different LLMs and few-shot examples from DialogSum and SAMSum. The pretrained language models fine-tuned on the synthetic data outperform the base models by 16.47%, while the comparison between models fine-tuned on in-domain data and synthetic data shows that the synthetic data is able to capture 90.48% of the distribution of the in-domain data. The quality of the data generated also scales with the size of LLMs. These results validate DiaSynth's potential as a robust alternative to traditional data collection methods.

DiaSynth -- 対話生成フレームワーク

DiaSynth -- Synthetic Dialogue Generation Framework

要旨

Support