简易数据集:一个统一且可扩展的框架,用于从非结构化文档中合成LLM微调数据
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents
July 5, 2025
作者: Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang
cs.AI
摘要
大型语言模型(LLMs)在通用任务上展现了卓越的性能,然而,由于高质量领域数据的稀缺性,将其适应于特定领域仍面临挑战。现有的数据合成工具往往难以从异构文档中有效提取可靠的微调数据。针对这一局限,我们提出了Easy Dataset,一个通过直观的图形用户界面(GUI)从非结构化文档中合成微调数据的统一框架。具体而言,Easy Dataset允许用户轻松配置文本提取模型和分块策略,将原始文档转化为连贯的文本块。随后,它利用角色驱动的提示方法,借助公开可用的LLMs生成多样化的问答对。在整个流程中,人机交互的可视化界面促进了中间结果的审查与优化,以确保数据质量。在金融问答任务上的实验表明,基于合成数据集微调的LLMs显著提升了领域特定性能,同时保留了通用知识。源代码及可安装包已发布于https://github.com/ConardLi/easy-dataset,并获得了超过9,000个GitHub星标。
English
Large language models (LLMs) have shown impressive performance on
general-purpose tasks, yet adapting them to specific domains remains
challenging due to the scarcity of high-quality domain data. Existing data
synthesis tools often struggle to extract reliable fine-tuning data from
heterogeneous documents effectively. To address this limitation, we propose
Easy Dataset, a unified framework for synthesizing fine-tuning data from
unstructured documents via an intuitive graphical user interface (GUI).
Specifically, Easy Dataset allows users to easily configure text extraction
models and chunking strategies to transform raw documents into coherent text
chunks. It then leverages a persona-driven prompting approach to generate
diverse question-answer pairs using public-available LLMs. Throughout the
pipeline, a human-in-the-loop visual interface facilitates the review and
refinement of intermediate outputs to ensure data quality. Experiments on a
financial question-answering task show that fine-tuning LLMs on the synthesized
dataset significantly improves domain-specific performance while preserving
general knowledge. The source code and installable package are available at
https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub
stars.