Sorted LLaMA: ソートされたファインチューニング（SoFT）を用いた大規模言語モデルの中間層の可能性を解き放つ動的推論

要旨

大規模言語モデル（LLM）の急速な進展は、自然言語処理（NLP）に革命をもたらしました。これらのモデルは人間のようなテキストの理解と生成に優れていますが、その広範な展開には莫大なコストがかかることがあります。SortedNetは、ディープニューラルネットワークの動的推論を可能にする最近のトレーニング技術です。ネットワークのモジュール性を活用して、さまざまな計算負荷を持つサブモデルを作成し、それらを計算量/精度特性に基づいてネストされた形でソートします。本論文では、SortedNetを生成型NLPタスクに拡張し、大規模言語モデルを動的にするために、事前学習を一切行わず、標準的な教師ありファインチューニング（SFT）をSorted Fine-Tuning（SoFT）に置き換えるだけで、同じコストで実現します。このアプローチにより、モデルの効率が向上し、推論時にさまざまなシナリオに対応するための複数のモデルを用意する必要がなくなります。この手法を用いることで、トランスフォーマーの中間層がターゲット出力を生成する潜在能力を引き出せることを示します。サブモデルは元のモデルの不可欠な構成要素として残り、ストレージ要件や異なる計算量/レイテンシ予算間の移行コストを最小限に抑えます。このアプローチをLLaMa 2 13Bに適用し、Stanford Alpacaデータセットでチューニングを行い、通常のチューニングやPandaLMベンチマークによる早期終了と比較することで、Sorted Fine-Tuningが元のモデルの2倍の速度を実現しつつ、性能を維持または向上させることを示します。

English

The rapid advancement of large language models (LLMs) has revolutionized natural language processing (NLP). While these models excel at understanding and generating human-like text, their widespread deployment can be prohibitively expensive. SortedNet is a recent training technique for enabling dynamic inference for deep neural networks. It leverages network modularity to create sub-models with varying computational loads, sorting them based on computation/accuracy characteristics in a nested manner. We extend SortedNet to generative NLP tasks, making large language models dynamic without any pretraining and by only replacing standard Supervised Fine-Tuning (SFT) with Sorted Fine-Tuning (SoFT) at the same costs. Our approach boosts model efficiency, eliminating the need for multiple models for various scenarios during inference. We show that using this approach, we are able to unlock the potential of intermediate layers of transformers in generating the target output. Our sub-models remain integral components of the original model, minimizing storage requirements and transition costs between different computational/latency budgets. By applying this approach on LLaMa 2 13B for tuning on the Stanford Alpaca dataset and comparing it to normal tuning and early exit via PandaLM benchmark, we show that Sorted Fine-Tuning can deliver models twice as fast as the original model while maintaining or exceeding performance.

Sorted LLaMA: ソートされたファインチューニング（SoFT）を用いた大規模言語モデルの中間層の可能性を解き放つ動的推論

Sorted LLaMA: Unlocking the Potential of Intermediate Layers of Large Language Models for Dynamic Inference Using Sorted Fine-Tuning (SoFT)

要旨

Support