Unitxt：靈活、可共享和可重複使用的數據準備和評估用於生成式人工智能

摘要

在生成式自然語言處理領域的動態格局中，傳統的文本處理流程限制了研究的靈活性和可重現性，因為它們是針對特定數據集、任務和模型組合而設計的。不斷升級的複雜性涉及系統提示、特定於模型的格式、指令等，呼籲轉向結構化、模塊化和可定制的解決方案。為了滿足這一需求，我們提出了 Unitxt，這是一個創新的庫，專為生成式語言模型量身定制的文本數據準備和評估而設計。Unitxt 與 HuggingFace 和 LM-eval-harness 等常見庫進行本地集成，並將處理流程拆分為模塊化組件，從而實現了易於定制和共享。這些組件涵蓋了特定於模型的格式、任務提示以及許多其他全面的數據集處理定義。Unitxt-Catalog 將這些組件集中在一起，促進了現代文本數據流程中的協作和探索。Unitxt 不僅僅是一個工具，更是一個社區驅動的平台，讓用戶可以共同構建、共享和推進他們的流程。加入 Unitxt 社區，請訪問 https://github.com/IBM/unitxt！

English

In the dynamic landscape of generative NLP, traditional text processing pipelines limit research flexibility and reproducibility, as they are tailored to specific dataset, task, and model combinations. The escalating complexity, involving system prompts, model-specific formats, instructions, and more, calls for a shift to a structured, modular, and customizable solution. Addressing this need, we present Unitxt, an innovative library for customizable textual data preparation and evaluation tailored to generative language models. Unitxt natively integrates with common libraries like HuggingFace and LM-eval-harness and deconstructs processing flows into modular components, enabling easy customization and sharing between practitioners. These components encompass model-specific formats, task prompts, and many other comprehensive dataset processing definitions. The Unitxt-Catalog centralizes these components, fostering collaboration and exploration in modern textual data workflows. Beyond being a tool, Unitxt is a community-driven platform, empowering users to build, share, and advance their pipelines collaboratively. Join the Unitxt community at https://github.com/IBM/unitxt!

Unitxt：靈活、可共享和可重複使用的數據準備和評估用於生成式人工智能

Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative AI

摘要

Support