DataDreamer: 합성 데이터 생성 및 재현 가능한 LLM 워크플로우를 위한 도구

초록

대규모 언어 모델(LLM)은 다양한 NLP 작업에서 연구자들에게 주요하고 중요한 도구로 자리 잡았습니다. 현재 많은 연구자들이 합성 데이터 생성, 작업 평가, 미세 조정, 증류 및 기타 모델-인-더-루프 연구 워크플로우에서 LLM을 사용하고 있습니다. 그러나 이러한 모델을 사용함에 있어서 그 규모, 폐쇄적 소스 특성, 그리고 이러한 새로운 워크플로우를 위한 표준화된 도구의 부재로 인해 여러 도전 과제가 발생합니다. 이러한 모델의 급속한 부상과 독특한 도전 과제들은 개방형 과학과 이를 사용한 연구의 재현성에 즉각적인 부정적인 영향을 미쳤습니다. 본 논문에서는 연구자들이 간단한 코드를 작성하여 강력한 LLM 워크플로우를 구현할 수 있도록 하는 오픈 소스 Python 라이브러리인 DataDreamer를 소개합니다. DataDreamer는 또한 연구자들이 개방형 과학과 재현성을 장려하기 위해 제안한 모범 사례를 준수할 수 있도록 돕습니다. 라이브러리와 문서는 https://github.com/datadreamer-dev/DataDreamer에서 확인할 수 있습니다.

English

Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at https://github.com/datadreamer-dev/DataDreamer .

DataDreamer: 합성 데이터 생성 및 재현 가능한 LLM 워크플로우를 위한 도구

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

초록

Support