排序LLaMA：使用排序微调（SoFT）释放大型语言模型中间层的潜力，实现动态推理

摘要

大型语言模型（LLMs）的快速发展彻底改变了自然语言处理（NLP）。虽然这些模型擅长理解和生成类似人类的文本，但它们的广泛部署可能成本过高。SortedNet是一种用于实现深度神经网络动态推断的最新训练技术。它利用网络模块化来创建具有不同计算负载的子模型，并根据计算/准确性特征以嵌套方式对其进行排序。我们将SortedNet扩展到生成式NLP任务，使大型语言模型在没有任何预训练的情况下动态化，并仅通过用Sorted Fine-Tuning（SoFT）替换标准监督微调（SFT）来实现相同的成本。我们的方法提高了模型效率，消除了在推断过程中针对不同场景需要多个模型的需求。我们展示了通过使用这种方法，我们能够释放transformers中间层在生成目标输出方面的潜力。我们的子模型仍然是原始模型的重要组成部分，最小化了存储需求和在不同计算/延迟预算之间的转换成本。通过在LLaMa 2 13B上应用这种方法，在斯坦福Alpaca数据集上进行调整，并将其与正常调整和通过PandaLM基准测试进行早期退出进行比较，我们展示了Sorted Fine-Tuning可以以两倍于原始模型的速度交付模型，同时保持或超越性能。

English

The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP). While these models excel at understanding and generating human-like text, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference for deep neural networks. It leverages network modularity to create sub-models with varying computational loads, sorting them based on computation/accuracy characteristics in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that using this approach, we are able to unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. By applying this approach on LLaMa 2 13B for tuning on the Stanford Alpaca dataset and comparing it to normal tuning and early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver models twice as fast as the original model while maintaining or exceeding performance.

排序LLaMA：使用排序微调（SoFT）释放大型语言模型中间层的潜力，实现动态推理

Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)

摘要

Support