ChatPaper.aiChatPaper

Unitxt:灵活、可共享和可重复使用的数据准备和评估 用于生成式人工智能

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

January 25, 2024
作者: Elron Bandel, Yotam Perlitz, Elad Venezian, Roni Friedman-Melamed, Ofir Arviv, Matan Orbach, Shachar Don-Yehyia, Dafna Sheinwald, Ariel Gera, Leshem Choshen, Michal Shmueli-Scheuer, Yoav Katz
cs.AI

摘要

在生成式自然语言处理(NLP)的动态领域中,传统的文本处理流程限制了研究的灵活性和可重现性,因为它们针对特定的数据集、任务和模型组合进行了定制。随着系统提示、模型特定格式、指令等日益复杂,需要转向结构化、模块化和可定制的解决方案。为了满足这一需求,我们推出了Unitxt,这是一个创新的库,专门用于定制生成式语言模型的文本数据准备和评估。Unitxt与HuggingFace和LM-eval-harness等常用库进行了本地集成,并将处理流程拆分为模块化组件,实现了从业者之间的轻松定制和共享。这些组件涵盖了模型特定格式、任务提示以及许多其他全面的数据集处理定义。Unitxt-Catalog集中了这些组件,促进了现代文本数据工作流中的协作和探索。Unitxt不仅是一个工具,还是一个社区驱动的平台,赋予用户共同构建、共享和推进流程的能力。加入Unitxt社区,访问https://github.com/IBM/unitxt!
English
In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!
PDF241December 15, 2024