Genie: 콘텐츠 기반 데이터셋 생성에서 인간 수준의 성능 달성

초록

콘텐츠 기반 생성 작업을 위한 고품질 데이터의 부족은 이러한 작업의 발전을 가로막는 주요 장애물로 지적되어 왔습니다. 이러한 격차를 해결하기 위해, 우리는 고품질의 콘텐츠 기반 데이터를 자동으로 생성하는 새로운 방법인 Genie를 제안합니다. 이 방법은 세 단계로 구성됩니다: (a) 콘텐츠 준비, (b) 생성: 콘텐츠에서 작업별 예시를 생성(예: 질문-답변 쌍 또는 요약), (c) 생성된 데이터의 품질과 신뢰성을 보장하기 위한 필터링 메커니즘. 우리는 이 방법론을 장문형 질문-답변(LFQA), 요약, 정보 추출을 위한 세 가지 대규모 합성 데이터를 생성하여 입증합니다. 인간 평가에서, 우리가 생성한 데이터는 자연스럽고 고품질로 평가되었습니다. 또한, 우리는 우리의 데이터로 훈련된 모델과 인간이 작성한 데이터(ELI5 및 ASQA는 LFQA용, CNN-DailyMail은 요약용)로 훈련된 모델을 비교합니다. 우리의 모델은 인간이 생성한 데이터로 훈련된 모델과 동등하거나 더 나은 성능을 보이며, 특히 신뢰성 측면에서 일관되게 우수함을 보여줍니다. 마지막으로, 우리는 의료 영역 내에서 LFQA 데이터를 생성하기 위해 이 방법을 적용하고, 이를 다른 영역에서 훈련된 모델과 비교했습니다.

English

The lack of high-quality data for content-grounded generation tasks has been identified as a major obstacle to advancing these tasks. To address this gap, we propose Genie, a novel method for automatically generating high-quality content-grounded data. It consists of three stages: (a) Content Preparation, (b) Generation: creating task-specific examples from the content (e.g., question-answer pairs or summaries). (c) Filtering mechanism aiming to ensure the quality and faithfulness of the generated data. We showcase this methodology by generating three large-scale synthetic data, making wishes, for Long-Form Question-Answering (LFQA), summarization, and information extraction. In a human evaluation, our generated data was found to be natural and of high quality. Furthermore, we compare models trained on our data with models trained on human-written data -- ELI5 and ASQA for LFQA and CNN-DailyMail for Summarization. We show that our models are on par with or outperforming models trained on human-generated data and consistently outperforming them in faithfulness. Finally, we applied our method to create LFQA data within the medical domain and compared a model trained on it with models trained on other domains.

Genie: 콘텐츠 기반 데이터셋 생성에서 인간 수준의 성능 달성

Genie: Achieving Human Parity in Content-Grounded Datasets Generation

초록

Support