VIA-SD: 추측 디코딩을 위한 모델 내부 라우팅 기반 검증

초록

추측 디코딩(SD)은 경량 초안 생성기가 대규모 검증기를 위해 후보들을 병렬로 생성하여 검증하도록 함으로써 대규모 언어 모델(LLM)의 높은 추론 비용을 해결한다. 기존의 초안-검증 방식은 수락 또는 완전 재계산이라는 이진 결정을 사용한다. 그러나 우리는 많은 거부된 토큰들이 전체 검증기 대신, 모델 내 라우팅을 통해 전체 검증기에서 파생된 간소화된 하위 모델에 의해 올바르게 검증될 수 있음을 발견한다. 이는 중간 수준의 검증 자원이 필요한 토큰을 처리하는 간소화 검증기에 대한 동기를 부여하며, 값비싼 대규모 모델 호출을 줄인다. 우리는 모델 내 라우팅을 통한 검증 기반 추측 디코딩(VIA-SD)을 제안한다. 이는 라우팅된 간소화 검증기를 사용하는 다중 계층 프레임워크이다. 초안 토큰은 계층적으로 처리된다: 높은 신뢰도의 경우 직접 수락, 중간 신뢰도의 경우 간소화 검증기 재생성, 불확실한 경우 전체 모델 검증. 네 가지 대표적인 작업과 여러 모델 계열에 걸쳐, VIA-SD는 거부율을 0.10~0.22 감소시키고 강력한 SD 기준선 대비 10~20%의 속도 향상을 제공하며, 초안 없는 디코딩 대비 2.5~3배의 가속을 달성한다. 또한 VIA-SD는 훈련 절차를 수정하지 않고 기존 SD 프레임워크와 호환 가능하다. 우리의 결과는 확장 가능하고 효율적인 LLM 추론을 위한 일반적인 패러다임으로서 다중 계층 SD를 제안한다. 프로젝트 페이지: https://zju-xyc.github.io/VIA-SD-Project-Page/

English

Speculative decoding (SD) addresses the high inference costs of LLMs by having lightweight drafters generate candidates for large verifiers to validate in parallel. Existing draft-verify methods use binary decisions: accept or fully recompute. Yet we find that many rejected tokens can be verified correctly by a slim submodel derived from the full verifier via intra-model routing, instead of the full verifier. This motivates our slim-verifier to handle tokens requiring moderate verification resources, reducing expensive large-model calls. We propose Verification via Intra-Model Routing for Speculative Decoding (VIA-SD), a multi-tier framework using a routed slim-verifier. Draft tokens are processed hierarchically: direct acceptance for high-confidence cases, slim-verifier regeneration for medium-confidence cases, and full-model verification for uncertain cases. Across four representative tasks and multiple model families, VIA-SD reduces rejection rates by 0.10-0.22 and delivers 10-20% speedups over strong SD baselines, while achieving 2.5-3x acceleration over non-drafting decoding. Moreover, VIA-SD is compatible with existing SD frameworks without modifying their training procedures. Our results suggest multi-tier SD as a general paradigm for scalable and efficient LLM inference. Project page: https://zju-xyc.github.io/VIA-SD-Project-Page/