ChatPaper.aiChatPaper

PROMPTEVALS:一個用於客製化生產大型語言模型管線的斷言與防護機制資料集

PROMPTEVALS: A Dataset of Assertions and Guardrails for Custom Production Large Language Model Pipelines

April 20, 2025
作者: Reya Vir, Shreya Shankar, Harrison Chase, Will Fu-Hinthorn, Aditya Parameswaran
cs.AI

摘要

大型語言模型(LLMs)正日益被部署於跨多領域的專業生產數據處理管道中,例如金融、行銷和電子商務。然而,當在生產環境中對大量輸入運行這些模型時,它們往往無法遵循指令或滿足開發者的期望。為了提高這些應用中的可靠性,建立針對LLM輸出的斷言或防護機制以與管道並行運行至關重要。然而,確定能捕捉開發者對任務需求的適當斷言集合是一項挑戰。在本論文中,我們介紹了PROMPTEVALS,這是一個包含2087個LLM管道提示及12623條相應斷言標準的數據集,這些數據來自使用我們開源LLM管道工具的開發者。該數據集比之前的集合大5倍。利用PROMPTEVALS的保留測試集作為基準,我們評估了閉源和開源模型在生成相關斷言方面的表現。值得注意的是,我們微調後的Mistral和Llama 3模型平均比GPT-4o高出20.93%,不僅降低了延遲,還提升了性能。我們相信,我們的數據集能夠激發更多關於LLM可靠性、對齊和提示工程的研究。
English
Large language models (LLMs) are increasingly deployed in specialized production data processing pipelines across diverse domains -- such as finance, marketing, and e-commerce. However, when running them in production across many inputs, they often fail to follow instructions or meet developer expectations. To improve reliability in these applications, creating assertions or guardrails for LLM outputs to run alongside the pipelines is essential. Yet, determining the right set of assertions that capture developer requirements for a task is challenging. In this paper, we introduce PROMPTEVALS, a dataset of 2087 LLM pipeline prompts with 12623 corresponding assertion criteria, sourced from developers using our open-source LLM pipeline tools. This dataset is 5x larger than previous collections. Using a hold-out test split of PROMPTEVALS as a benchmark, we evaluated closed- and open-source models in generating relevant assertions. Notably, our fine-tuned Mistral and Llama 3 models outperform GPT-4o by 20.93% on average, offering both reduced latency and improved performance. We believe our dataset can spur further research in LLM reliability, alignment, and prompt engineering.

Summary

AI-Generated Summary

PDF42April 22, 2025