直接多令牌解碼
Direct Multi-Token Decoding
October 13, 2025
作者: Xuan Luo, Weizhi Wang, Xifeng Yan
cs.AI
摘要
僅解碼器架構的Transformer已成為大型語言模型(LLMs)的標準架構,因其卓越的性能而備受青睞。近期研究表明,在預訓練的LLMs中,模型的早期、中期和晚期層可能承擔著不同的角色:早期層專注於理解輸入上下文,中期層處理特定任務的運算,而晚期層則將抽象表示轉化為輸出詞元。我們提出假設,一旦表示經過早期和中期層的處理,所得到的隱藏狀態可能已包含足夠的信息,僅需晚期層即可支持多個詞元的生成,從而無需重複遍歷早期和中期層。我們將這一推理範式稱為直接多詞元解碼(Direct Multi-Token Decoding, DMTD)。與推測解碼不同,我們的方法不引入額外參數、輔助程序或生成後驗證。儘管在有限數據集上進行了訓練,經過微調的DMTD Qwen3-4B模型已展現出令人鼓舞的成果,實現了最高達2倍的加速,且僅伴隨輕微的性能損失。此外,如我們的規模分析所示,隨著訓練數據集的擴大,其性能有望進一步提升。
English
Decoder-only transformers have become the standard architecture for large
language models (LLMs) due to their strong performance. Recent studies suggest
that, in pre-trained LLMs, early, middle, and late layers may serve distinct
roles: Early layers focus on understanding the input context, middle layers
handle task-specific processing, and late layers convert abstract
representations into output tokens. We hypothesize that once representations
have been processed by the early and middle layers, the resulting hidden states
may encapsulate sufficient information to support the generation of multiple
tokens using only the late layers, eliminating the need to repeatedly traverse
the early and middle layers. We refer to this inference paradigm as Direct
Multi-Token Decoding (DMTD). Unlike speculative decoding, our method introduces
no additional parameters, auxiliary routines, or post-generation verification.
Despite being trained on a limited dataset, a fine-tuned DMTD Qwen3-4B model
has already demonstrated promising results, achieving up to a 2x speedup with
only minor performance loss. Moreover, as shown in our scaling analysis, its
performance is expected to further improve with larger training datasets.