ChatPaper.aiChatPaper

SPEED-Bench:面向推测解码的统一化多元基准测试框架

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

February 10, 2026
作者: Talor Abramovich, Maor Ashkenazi, Carl, Putterman, Benjamin Chislett, Tiyasa Mitra, Bita Darvish Rouhani, Ran Zilberstein, Yonatan Geifman
cs.AI

摘要

推測解碼(SD)已成為加速大型語言模型(LLM)推理的關鍵技術。與確定性系統優化不同,SD性能本質上具有數據依賴性,這意味著需要多樣化且具代表性的工作負載才能準確衡量其效能。現有基準測試存在任務多樣性不足、對吞吐量導向評估支持不夠,以及依賴無法反映生產環境的高層級實現等問題。為此,我們推出SPEED-Bench——一個旨在跨多樣語義領域和真實服務場景標準化SD評估的綜合測試套件。該套件提供經精心篩選的定性數據分集,其樣本選擇優先考慮語義多樣性;同時包含吞吐量數據分集,支持從延遲敏感的低批次設置到吞吐量導向的高負載場景等多種並發級別下的加速比評估。通過與vLLM、TensorRT-LLM等生產級引擎集成,SPEED-Bench可幫助從業者分析常被其他基準測試掩蓋的系統行為。我們通過量化合成輸入對真實吞吐量的高估效應、揭示批次大小依賴性的最優草稿長度與低多樣性數據偏差,以及分析頂尖草稿模型中詞表剪枝的注意事項來驗證這一價值。我們開源SPEED-Bench,旨在為SD算法的實際比較建立統一評估標準。
English
Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.
PDF81April 15, 2026