Source2Synth:基於真實數據來源的合成數據生成與整理
Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources
September 12, 2024
作者: Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli
cs.AI
摘要
大型語言模型在利用結構化數據、複雜推理或工具使用等具有挑戰性的情境中仍然面臨困難。在本文中,我們提出了Source2Synth:一種新方法,可用於教導大型語言模型新技能,而無需依賴昂貴的人工標註。Source2Synth 採用自定義數據來源作為輸入,並生成具有中間推理步驟的合成數據點,這些步驟基於真實世界的來源。Source2Synth 通過根據可回答性丟棄低質量生成,從而提高數據集的質量。我們通過將其應用於兩個具有挑戰性的領域來展示此方法的普遍性:我們在多跳問答(MHQA)中測試推理能力,在表格問答(TQA)中測試工具使用。與微調基線相比,我們的方法使 WikiSQL 上的 TQA 表現提高了 25.51%,在 HotPotQA 上的 MHQA 表現提高了 22.57%。
English
Large Language Models still struggle in challenging scenarios that leverage
structured data, complex reasoning, or tool usage. In this paper, we propose
Source2Synth: a new method that can be used for teaching LLMs new skills
without relying on costly human annotations. Source2Synth takes as input a
custom data source and produces synthetic data points with intermediate
reasoning steps grounded in real-world sources. Source2Synth improves the
dataset quality by discarding low-quality generations based on their
answerability. We demonstrate the generality of this approach by applying it to
two challenging domains: we test reasoning abilities in multi-hop question
answering (MHQA), and tool usage in tabular question answering (TQA). Our
method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on
HotPotQA compared to the fine-tuned baselines.Summary
AI-Generated Summary