쉬운 데이터셋: 비정형 문서에서 LLM 미세 조정 데이터를 합성하기 위한 통합 및 확장 가능한 프레임워크

초록

대규모 언어 모델(LLMs)은 일반적인 작업에서 인상적인 성능을 보여주고 있지만, 특정 도메인에 적용하는 것은 고품질 도메인 데이터의 부족으로 인해 여전히 어려운 과제로 남아 있습니다. 기존의 데이터 합성 도구들은 이질적인 문서에서 신뢰할 수 있는 미세 조정 데이터를 효과적으로 추출하는 데 어려움을 겪습니다. 이러한 한계를 해결하기 위해, 우리는 직관적인 그래픽 사용자 인터페이스(GUI)를 통해 비정형 문서에서 미세 조정 데이터를 합성하기 위한 통합 프레임워크인 Easy Dataset을 제안합니다. 구체적으로, Easy Dataset은 사용자가 텍스트 추출 모델과 청킹 전략을 쉽게 구성하여 원시 문서를 일관된 텍스트 청크로 변환할 수 있도록 합니다. 그런 다음, 공개된 LLMs를 사용하여 다양한 질문-답변 쌍을 생성하기 위해 페르소나 기반 프롬프팅 접근 방식을 활용합니다. 전체 파이프라인에서 인간이 참여하는 시각적 인터페이스는 중간 출력물을 검토하고 개선하여 데이터 품질을 보장합니다. 금융 질문-답변 작업에 대한 실험 결과, 합성된 데이터셋으로 LLMs를 미세 조정하면 도메인 특화 성능이 크게 향상되면서도 일반 지식을 유지할 수 있음을 보여줍니다. 소스 코드와 설치 가능한 패키지는 https://github.com/ConardLi/easy-dataset에서 확인할 수 있으며, 9,000개 이상의 GitHub 스타를 받았습니다.

English

Large language models (LLMs) have shown impressive performance on general-purpose tasks, yet adapting them to specific domains remains challenging due to the scarcity of high-quality domain data. Existing data synthesis tools often struggle to extract reliable fine-tuning data from heterogeneous documents effectively. To address this limitation, we propose Easy Dataset, a unified framework for synthesizing fine-tuning data from unstructured documents via an intuitive graphical user interface (GUI). Specifically, Easy Dataset allows users to easily configure text extraction models and chunking strategies to transform raw documents into coherent text chunks. It then leverages a persona-driven prompting approach to generate diverse question-answer pairs using public-available LLMs. Throughout the pipeline, a human-in-the-loop visual interface facilitates the review and refinement of intermediate outputs to ensure data quality. Experiments on a financial question-answering task show that fine-tuning LLMs on the synthesized dataset significantly improves domain-specific performance while preserving general knowledge. The source code and installable package are available at https://github.com/ConardLi/easy-dataset and have garnered over 9,000 GitHub stars.

쉬운 데이터셋: 비정형 문서에서 LLM 미세 조정 데이터를 합성하기 위한 통합 및 확장 가능한 프레임워크

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

초록

Support