Tiny QA Benchmark++：超轻量级、合成多语言数据集生成与持续大语言模型评估的冒烟测试

摘要

Tiny QA Benchmark++（TQB++）推出了一套超轻量级、多语言的冒烟测试套件，旨在为大型语言模型（LLM）管道提供一种单元测试风格的安全网数据集，该套件能在几秒内以极低成本完成运行。这一工具源于构建Comet Opik提示优化SDK时对快速反馈循环的需求，因为等待重量级基准测试会打断开发流程。TQB++结合了一个包含52项英语黄金标准的数据集（小于20 kB）和一个基于与提供商无关的LiteLLM构建的微型合成数据生成器PyPI包。该生成器使从业者能够以任何语言、领域或难度创建自己的微型数据集包，同时已提供了涵盖阿拉伯语、中文、法语、德语、日语、韩语、葡萄牙语、俄语、西班牙语和土耳其语的十个现成数据包。每个数据集都附带了Croissant元数据以及即插即用的文件，适用于OpenAI-Evals、LangChain和标准CI工具，因此团队可以直接将确定性微基准测试集成到拉取请求门控、提示工程循环和生产仪表板中，而无需触及GPU预算。完整的TQB++运行仅增加管道延迟几秒钟，却能可靠地标记出提示模板错误、分词器漂移和微调副作用，远在MMLU或BIG-Bench等大规模测试套件完成配置之前。整个框架的发布旨在加速生成式AI生态系统中持续且资源高效的质量保证进程。

English

Tiny QA Benchmark++ (TQB++) presents an ultra-lightweight, multilingual smoke-test suite designed to give large-language-model (LLM) pipelines a unit-test style safety net dataset that runs in seconds with minimal cost. Born out of the tight feedback-loop demands building the Comet Opik prompt-optimization SDK, where waiting on heavyweight benchmarks breaks developer flow. TQB++ couples a 52-item English gold set (less than 20 kB) with a tiny synthetic-data generator pypi package built on provider-agnostic LiteLLM. The generator lets practitioners mint their own tiny packs in any language, domain, or difficulty, while ten ready-made packs already cover Arabic, Chinese, French, German, Japanese, Korean, Portuguese, Russian, Spanish, and Turkish. Every dataset ships with Croissant metadata and plug-and-play files for OpenAI-Evals, LangChain, and standard CI tools, so teams can drop deterministic micro-benchmarks directly into pull-request gates, prompt-engineering loops, and production dashboards without touching GPU budgets. A complete TQB++ run adds only a few seconds to pipeline latency yet reliably flags prompt-template errors, tokenizer drift, and fine-tuning side-effects long before full-scale suites like MMLU or BIG-Bench would finish configuring. The entire framework is released to accelerate continuous, resource-efficient quality assurance across the generative-AI ecosystem.

Tiny QA Benchmark++：超轻量级、合成多语言数据集生成与持续大语言模型评估的冒烟测试

Tiny QA Benchmark++: Ultra-Lightweight, Synthetic Multilingual Dataset Generation & Smoke-Tests for Continuous LLM Evaluation

摘要

Support