灵魔：在基于内容的数据集生成中实现人类水平

摘要

针对内容驱动生成任务缺乏高质量数据的问题被确定为推进这些任务的主要障碍。为了解决这一差距，我们提出了Genie，一种新颖的方法，用于自动生成高质量的内容驱动数据。它包括三个阶段：(a) 内容准备，(b) 生成：从内容中创建特定任务的示例（例如，问答对或摘要），(c) 过滤机制旨在确保生成数据的质量和忠实度。我们通过生成三个大规模合成数据来展示这种方法，用于长文本问答（LFQA）、摘要和信息提取。在人类评估中，我们生成的数据被发现自然且高质量。此外，我们将在我们的数据上训练的模型与在人类编写数据上训练的模型进行比较 -- 对于LFQA，我们使用ELI5和ASQA，对于摘要，我们使用CNN-DailyMail。我们展示我们的模型与在人类生成数据上训练的模型不相上下，甚至在忠实度方面始终表现优越。最后，我们应用我们的方法在医学领域内创建LFQA数据，并将在此数据上训练的模型与在其他领域训练的模型进行比较。

English

The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose Genie, a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating three large-scale synthetic data, making wishes, for Long-Form Question-Answering (LFQA), summarization, and information extraction. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data -- ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.

灵魔：在基于内容的数据集生成中实现人类水平

Genie: Achieving Human Parity in Content-Grounded Datasets Generation

摘要

Support