Source2Synth：実データソースに基づく合成データ生成とキュレーション

要旨

大規模言語モデルは、構造化データ、複雑な推論、またはツールの使用を活用する難しいシナリオでまだ苦労しています。本論文では、高コストな人手注釈に頼らずに、LLMに新しいスキルを教えるために使用できる新しい手法であるSource2Synthを提案します。Source2Synthは、カスタムデータソースを入力として受け取り、現実世界のソースに基づく中間推論ステップを持つ合成データポイントを生成します。Source2Synthは、回答可能性に基づいて低品質の生成物を破棄することでデータセットの品質を向上させます。この手法の汎用性を示すために、2つの難しい領域に適用します。マルチホップ質問応答（MHQA）における推論能力をテストし、表形式の質問応答（TQA）におけるツールの使用をテストします。WikiSQLにおけるTQAのパフォーマンスは、ファインチューニングされたベースラインと比較して25.51%、HotPotQAにおけるMHQAのパフォーマンスは22.57%向上します。

English

Large Language Models still struggle in challenging scenarios that leverage structured data, complex reasoning, or tool usage. In this paper, we propose Source2Synth: a new method that can be used for teaching LLMs new skills without relying on costly human annotations. Source2Synth takes as input a custom data source and produces synthetic data points with intermediate reasoning steps grounded in real-world sources. Source2Synth improves the dataset quality by discarding low-quality generations based on their answerability. We demonstrate the generality of this approach by applying it to two challenging domains: we test reasoning abilities in multi-hop question answering (MHQA), and tool usage in tabular question answering (TQA). Our method improves performance by 25.51% for TQA on WikiSQL and 22.57% for MHQA on HotPotQA compared to the fine-tuned baselines.

Source2Synth：実データソースに基づく合成データ生成とキュレーション

Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources

要旨

Support