Tiny QA Benchmark++:超輕量級、多語言合成數據集 生成與煙霧測試,用於持續大型語言模型評估
Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation
May 17, 2025
作者: Vincent Koc
cs.AI
摘要
Tiny QA Benchmark++(TQB++)提供了一套超輕量級、多語言的冒煙測試套件,旨在為大型語言模型(LLM)管道提供一個單元測試風格的安全網數據集,該數據集能在幾秒內運行且成本極低。該套件源於開發Comet Opik提示優化SDK時對緊密反饋循環的需求,因為等待重量級基準測試會打斷開發流程。TQB++結合了一個包含52個項目的英文黃金集(小於20 kB)和一個基於提供商無關的LiteLLM構建的微型合成數據生成器pypi包。該生成器允許從業者根據任何語言、領域或難度創建自己的微型數據包,同時已準備好的十個數據包涵蓋了阿拉伯語、中文、法語、德語、日語、韓語、葡萄牙語、俄語、西班牙語和土耳其語。每個數據集都附帶Croissant元數據和即插即用文件,適用於OpenAI-Evals、LangChain和標準的CI工具,使團隊能夠將確定性的微基準測試直接集成到拉取請求門控、提示工程循環和生產儀表板中,而無需觸及GPU預算。完整的TQB++運行僅增加管道延遲幾秒鐘,但能可靠地標記出提示模板錯誤、分詞器漂移和微調副作用,遠在MMLU或BIG-Bench等全面套件完成配置之前。整個框架的發布旨在加速生成式AI生態系統中持續且資源高效的質量保證。
English
Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual
smoke-test suite designed to give large-language-model (LLM) pipelines a
unit-test style safety net dataset that runs in seconds with minimal cost. Born
out of the tight feedback-loop demands building the Comet Opik
prompt-optimization SDK, where waiting on heavyweight benchmarks breaks
developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with
a tiny synthetic-data generator pypi package built on provider-agnostic
LiteLLM. The generator lets practitioners mint their own tiny packs in any
language, domain, or difficulty, while ten ready-made packs already cover
Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian,
Spanish, and Turkish. Every dataset ships with Croissant metadata and
plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so
teams can drop deterministic micro-benchmarks directly into pull-request gates,
prompt-engineering loops, and production dashboards without touching GPU
budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet
reliably flags prompt-template errors, tokenizer drift, and fine-tuning
side-effects long before full-scale suites like MMLU or BIG-Bench would finish
configuring. The entire framework is released to accelerate continuous,
resource-efficient quality assurance across the generative-AI ecosystem.Summary
AI-Generated Summary