ChatPaper.aiChatPaper

SPEED-Bench:面向推测解码的统一化多元基准测试框架

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

February 10, 2026
作者: Talor Abramovich, Maor Ashkenazi, Carl, Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman
cs.AI

摘要

推测解码(SD)已成为加速大语言模型(LLM)推理的关键技术。与确定性系统优化不同,SD性能本质上具有数据依赖性,这意味着需要多样且具代表性的工作负载才能准确衡量其效能。现有基准测试存在任务多样性有限、对吞吐量导向评估支持不足,以及依赖无法反映生产环境的高层实现等问题。为此,我们推出SPEED-Bench——一个旨在跨多样化语义域和实际服务场景标准化SD评估的综合测试套件。该套件提供经精心筛选的定性数据分割,其样本选择优先考虑语义多样性;同时包含吞吐量数据分割,支持从延迟敏感的低批处理设置到吞吐量导向的高负载场景等多并发条件下的加速比评估。通过与vLLM、TensorRT-LLM等生产级引擎集成,SPEED-Bench可帮助从业者分析常被其他基准测试掩盖的系统行为。我们通过量化合成输入对实际吞吐量的高估现象、识别批大小依赖的最佳草案长度与低多样性数据偏差,以及分析前沿草案模型中词汇表剪枝的注意事项来凸显这一价值。我们开源SPEED-Bench,旨在为SD算法的实际比较建立统一评估标准。
English
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
PDF81April 15, 2026