简易数据集：一个统一且可扩展的框架，用于从非结构化文档中合成LLM微调数据

摘要

大型语言模型（LLMs）在通用任务上展现了卓越的性能，然而，由于高质量领域数据的稀缺性，将其适应于特定领域仍面临挑战。现有的数据合成工具往往难以从异构文档中有效提取可靠的微调数据。针对这一局限，我们提出了Easy Dataset，一个通过直观的图形用户界面（GUI）从非结构化文档中合成微调数据的统一框架。具体而言，Easy Dataset允许用户轻松配置文本提取模型和分块策略，将原始文档转化为连贯的文本块。随后，它利用角色驱动的提示方法，借助公开可用的LLMs生成多样化的问答对。在整个流程中，人机交互的可视化界面促进了中间结果的审查与优化，以确保数据质量。在金融问答任务上的实验表明，基于合成数据集微调的LLMs显著提升了领域特定性能，同时保留了通用知识。源代码及可安装包已发布于https://github.com/ConardLi/easy-dataset，并获得了超过9,000个GitHub星标。

English

Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.

简易数据集：一个统一且可扩展的框架，用于从非结构化文档中合成LLM微调数据

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

摘要

Support