編碼器的回歸：最大化 SLMs 的參數效率

摘要

大型僅解碼語言模型的主導地位已經使編碼器-解碼器架構黯然失色，儘管在序列處理方面具有基本的效率優勢。對於小型語言模型（SLMs）- 即具有10億個參數或更少的模型 - 我們在GPU、CPU和NPU平台上的系統分析顯示，與邊緣設備上的僅解碼模型相比，編碼器-解碼器架構實現了47%更低的首令牌延遲和4.7倍的吞吐量。這些收益可以歸因於編碼器-解碼器的一次性輸入處理和理解與生成階段的有效分離。我們引入了一個新穎的知識蒸餾框架，使編碼器-解碼器模型能夠利用來自大型可擴展僅解碼教師的能力，同時保留其架構優勢，在各種任務中實現高達6個平均性能點的改進，尤其在輸入和輸出分佈可以從不同處理方法中受益的非對稱序列任務中獲得顯著收益。當與現代先進技術（如旋轉位置嵌入（RoPE）和視覺編碼器）結合時，我們的系統調查表明，編碼器-解碼器架構提供了一條更實用的道路，以在資源受限環境中部署功能強大的語言模型。我們的研究結果挑戰了僅解碼模型擴展的普遍趨勢，顯示隨著參數預算的減少，尤其是對於需要計算效率至關重要的設備和邊緣部署，架構選擇變得越來越重要。

English

The dominance of large decoder-only language models has overshadowed encoder-decoder architectures, despite their fundamental efficiency advantages in sequence processing. For small language models (SLMs) - those with 1 billion parameters or fewer - our systematic analysis across GPU, CPU, and NPU platforms reveals that encoder-decoder architectures achieve 47% lower first-token latency and 4.7x higher throughput compared to decoder-only models on edge devices. These gains may be attributed to encoder-decoder's one-time input processing and efficient separation of understanding and generation phases. We introduce a novel knowledge distillation framework that enables encoder-decoder models to leverage capabilities from large scalable decoder-only teachers while preserving their architectural advantages, achieving up to 6 average performance points improvement across diverse tasks, with significant gains in asymmetric sequence tasks where input and output distributions can benefit from different processing approaches. When combined with modern advances like Rotary Positional Embeddings (RoPE) and Vision encoders, our systematic investigation demonstrates that encoder-decoder architectures provide a more practical path toward deploying capable language models in resource-constrained environments. Our findings challenge the prevailing trend toward decoder-only scaling, showing that architectural choices become increasingly crucial as parameter budgets decrease, particularly for on-device and edge deployments where computational efficiency is paramount.

編碼器的回歸：最大化 SLMs 的參數效率

Return of the Encoder: Maximizing Parameter Efficiency for SLMs

摘要

Support