DataDreamer: 合成データ生成と再現可能なLLMワークフローのためのツール

要旨

大規模言語モデル（LLMs）は、NLP研究者にとって幅広いタスクにおいて主要かつ重要なツールとなっています。現在、多くの研究者がLLMsを合成データ生成、タスク評価、ファインチューニング、蒸留、およびその他のモデルインザループ研究ワークフローに使用しています。しかし、これらのモデルを使用する際には、その規模、クローズドソースの性質、そしてこれらの新興ワークフローに対する標準化されたツールの欠如といった課題が生じます。これらのモデルの急速な台頭とこれらの独特な課題は、オープンサイエンスとそれらを使用する研究の再現性に即座に悪影響を及ぼしています。本論文では、研究者がシンプルなコードを書いて強力なLLMワークフローを実装できるオープンソースのPythonライブラリであるDataDreamerを紹介します。DataDreamerはまた、オープンサイエンスと再現性を促進するために提案するベストプラクティスに従うのを支援します。ライブラリとドキュメントはhttps://github.com/datadreamer-dev/DataDreamerで利用可能です。

English

Large language models (LLMs) have become a dominant and important tool for NLP researchers in a wide range of tasks. Today, many researchers use LLMs in synthetic data generation, task evaluation, fine-tuning, distillation, and other model-in-the-loop research workflows. However, challenges arise when using these models that stem from their scale, their closed source nature, and the lack of standardized tooling for these new and emerging workflows. The rapid rise to prominence of these models and these unique challenges has had immediate adverse impacts on open science and on the reproducibility of work that uses them. In this paper, we introduce DataDreamer, an open source Python library that allows researchers to write simple code to implement powerful LLM workflows. DataDreamer also helps researchers adhere to best practices that we propose to encourage open science and reproducibility. The library and documentation are available at https://github.com/datadreamer-dev/DataDreamer .

DataDreamer: 合成データ生成と再現可能なLLMワークフローのためのツール

DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows

要旨

Support