DataDreamer:一個用於合成數據生成和可重現LLM工作流程的工具。
DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows
February 16, 2024
作者: Ajay Patel, Colin Raffel, Chris Callison-Burch
cs.AI
摘要
大型語言模型(LLMs)已成為自然語言處理研究人員在各種任務中的主要且重要工具。如今,許多研究人員在合成數據生成、任務評估、微調、蒸餾以及其他模型在迴圈中的研究工作流程中使用LLMs。然而,在使用這些模型時會遇到挑戰,這些挑戰源於它們的規模、封閉源代碼性質以及對於這些新興工作流程缺乏標準化工具。這些模型的迅速崛起和這些獨特挑戰對開放科學和使用它們的工作的可重複性產生了立即的負面影響。在本文中,我們介紹了DataDreamer,這是一個開源的Python庫,允許研究人員編寫簡單的代碼來實現強大的LLM工作流程。DataDreamer還幫助研究人員遵循我們提出的最佳實踐,以鼓勵開放科學和可重複性。該庫和文檔可在https://github.com/datadreamer-dev/DataDreamer 上找到。
English
Large language models (LLMs) have become a dominant and important tool for
NLP researchers in a wide range of tasks. Today, many researchers use LLMs in
synthetic data generation, task evaluation, fine-tuning, distillation, and
other model-in-the-loop research workflows. However, challenges arise when
using these models that stem from their scale, their closed source nature, and
the lack of standardized tooling for these new and emerging workflows. The
rapid rise to prominence of these models and these unique challenges has had
immediate adverse impacts on open science and on the reproducibility of work
that uses them. In this paper, we introduce DataDreamer, an open source Python
library that allows researchers to write simple code to implement powerful LLM
workflows. DataDreamer also helps researchers adhere to best practices that we
propose to encourage open science and reproducibility. The library and
documentation are available at https://github.com/datadreamer-dev/DataDreamer .