Tiny QA Benchmark++：超輕量級、多語言合成數據集生成與煙霧測試，用於持續大型語言模型評估

摘要

Tiny QA Benchmark++（TQB++）提供了一套超輕量級、多語言的冒煙測試套件，旨在為大型語言模型（LLM）管道提供一個單元測試風格的安全網數據集，該數據集能在幾秒內運行且成本極低。該套件源於開發Comet Opik提示優化SDK時對緊密反饋循環的需求，因為等待重量級基準測試會打斷開發流程。TQB++結合了一個包含52個項目的英文黃金集（小於20 kB）和一個基於提供商無關的LiteLLM構建的微型合成數據生成器pypi包。該生成器允許從業者根據任何語言、領域或難度創建自己的微型數據包，同時已準備好的十個數據包涵蓋了阿拉伯語、中文、法語、德語、日語、韓語、葡萄牙語、俄語、西班牙語和土耳其語。每個數據集都附帶Croissant元數據和即插即用文件，適用於OpenAI-Evals、LangChain和標準的CI工具，使團隊能夠將確定性的微基準測試直接集成到拉取請求門控、提示工程循環和生產儀表板中，而無需觸及GPU預算。完整的TQB++運行僅增加管道延遲幾秒鐘，但能可靠地標記出提示模板錯誤、分詞器漂移和微調副作用，遠在MMLU或BIG-Bench等全面套件完成配置之前。整個框架的發布旨在加速生成式AI生態系統中持續且資源高效的質量保證。

English

Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

Tiny QA Benchmark++：超輕量級、多語言合成數據集生成與煙霧測試，用於持續大型語言模型評估

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

摘要

Support