VIA-SD: 基于模型内路由的推测解码验证

摘要

推测解码（Speculative Decoding, SD）通过让轻量级草稿生成器并行生成候选序列，再由大型验证器并行验证，从而降低LLM的高推理成本。现有的草稿-验证方法采用二元决策：要么接受，要么完全重新计算。然而，我们发现许多被拒绝的token实际上可以通过从完整验证器中经模型内路由导出的精简子模型正确验证，而无需动用完整验证器。这一发现启发我们采用精简验证器来处理那些需要中等验证资源的token，从而减少对昂贵大型模型的调用。我们提出了VIA-SD（Verification via Intra-Model Routing for Speculative Decoding），一种利用路由精简验证器的多层级框架。草稿token以分层方式处理：高置信度情况直接接受，中等置信度情况由精简验证器重新生成，不确定情况则交由完整模型验证。在四个代表性任务及多个模型家族上，VIA-SD将拒绝率降低了0.10-0.22，在强SD基线上实现了10-20%的加速，相较于非草稿解码实现了2.5-3倍的加速。此外，VIA-SD与现有SD框架兼容，无需修改其训练流程。我们的结果表明，多层级SD是一种可扩展且高效的LLM推理通用范式。项目页面：https://zju-xyc.github.io/VIA-SD-Project-Page/

English

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/