SPEED-Bench: Een Uniform en Divers Benchmark voor Speculatieve Decodering

Samenvatting

Speculatief Decoderen (SD) is naar voren gekomen als een cruciale techniek voor het versnellen van inferentie bij Large Language Models (LLM's). In tegenstelling tot deterministische systeemoptimalisaties is de prestaties van SD inherent afhankelijk van de data, wat betekent dat diverse en representatieve workloads essentieel zijn om de effectiviteit ervan nauwkeurig te meten. Bestaande benchmarks kampen met beperkte taakdiversiteit, ontoereikende ondersteuning voor doorvoer-gerichte evaluatie en een afhankelijkheid van hoog-niveau implementaties die productieomgevingen niet goed weerspiegelen. Om dit aan te pakken, introduceren we SPEED-Bench, een uitgebreide suite ontworpen om SD-evaluatie te standaardiseren across diverse semantische domeinen en realistische bedieningsregimes. SPEED-Bench biedt een zorgvuldig samengestelde *Qualitative* data-split, geselecteerd door prioriteit te geven aan semantische diversiteit across de data samples. Daarnaast omvat het een *Throughput* data-split, waardoor snelheidswinst-evaluatie mogelijk is across een reeks van gelijktijdige verzoeken, van latentie-gevoelige instellingen met lage batchgroottes tot doorvoer-gerichte scenario's onder hoge belasting. Door integratie met productie-engines zoals vLLM en TensorRT-LLM stelt SPEED-Bench beoefenaars in staat om systeemgedrag te analyseren dat vaak door andere benchmarks wordt gemaskeerd. We belichten dit door te kwantificeren hoe synthetische inputs de werkelijke doorvoer overschatten, door batchgrootte-afhankelijke optimale concept-lengtes en vooroordelen in data met lage diversiteit te identificeren, en door de kanttekeningen bij vocabulary pruning in state-of-the-art concept-modellen te analyseren. We geven SPEED-Bench vrij om een uniforme evaluatiestandaard te vestigen voor praktische vergelijkingen van SD-algoritmen.

English

Speculative Decoding (SD) has emerged as a critical technique for accelerating Large Language Model (LLM) inference. Unlike deterministic system optimizations, SD performance is inherently data-dependent, meaning that diverse and representative workloads are essential for accurately measuring its effectiveness. Existing benchmarks suffer from limited task diversity, inadequate support for throughput-oriented evaluation, and a reliance on high-level implementations that fail to reflect production environments. To address this, we introduce SPEED-Bench, a comprehensive suite designed to standardize SD evaluation across diverse semantic domains and realistic serving regimes. SPEED-Bench offers a carefully curated Qualitative data split, selected by prioritizing semantic diversity across the data samples. Additionally, it includes a Throughput data split, allowing speedup evaluation across a range of concurrencies, from latency-sensitive low-batch settings to throughput-oriented high-load scenarios. By integrating with production engines like vLLM and TensorRT-LLM, SPEED-Bench allows practitioners to analyze system behaviors often masked by other benchmarks. We highlight this by quantifying how synthetic inputs overestimate real-world throughput, identifying batch-size dependent optimal draft lengths and biases in low-diversity data, and analyzing the caveats of vocabulary pruning in state-of-the-art drafters. We release SPEED-Bench to establish a unified evaluation standard for practical comparisons of SD algorithms.

SPEED-Bench: Een Uniform en Divers Benchmark voor Speculatieve Decodering

SPEED-Bench: A Unified and Diverse Benchmark for Speculative Decoding

Samenvatting

Support