Source2Synth：基于真实数据源的合成数据生成与整理

摘要

大型语言模型在涉及结构化数据、复杂推理或工具使用等具有挑战性的场景中仍然面临困难。在本文中，我们提出了Source2Synth：一种新方法，可用于教授LLM学习新技能，而无需依赖昂贵的人工注释。Source2Synth接受自定义数据源作为输入，并生成具有基于真实世界来源的中间推理步骤的合成数据点。Source2Synth通过丢弃根据可回答性的低质量生成来改善数据集质量。我们通过将该方法应用于两个具有挑战性的领域来展示此方法的普适性：我们测试多跳问题回答（MHQA）中的推理能力，以及表格问题回答（TQA）中的工具使用。与微调基线相比，我们的方法使WikiSQL上的TQA性能提高了25.51％，HotPotQA上的MHQA性能提高了22.57％。

English

Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.

Source2Synth：基于真实数据源的合成数据生成与整理

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

摘要

Support