ルーターチューニング：トランスフォーマーにおけるダイナミックデプスを可能にするためのシンプルで効果的なアプローチ

要旨

従来のトランスフォーマーモデルは、各入力トークンに固定量の計算リソースを割り当てるため、非効率で不要な計算が発生します。この課題に対処するために、深さの混合（MoD）が導入され、重要でないレイヤーをスキップすることで計算深さを動的に調整します。その有望さにも関わらず、現在のMoDアプローチは未だ探求されており、2つの主な課題に直面しています：（1）モデル全体とスキップするレイヤーを決定するルーターをトレーニングする必要による高いトレーニングコスト、および（2）重要なレイヤーがスキップされた際の性能低下のリスク。最初の課題に対処するために、小規模データセットでルーターのみを微調整するRouter-Tuning手法を提案します。これにより、モデル全体のトレーニングに伴う計算オーバーヘッドが大幅に削減されます。2つ目の課題に対処するために、重要なレイヤーがスキップされる際にモデルの性能を保持しつつ、Attention with Dynamic Depthsを展開するMindSkip手法を提案します。この手法は、計算とメモリの効率を著しく向上させながら、モデルの性能を維持します。包括的な実験により、当社の手法が競争力のある結果を提供し、計算効率を著しく向上させることが示されました（例：21％の高速化とわずか0.2％の性能低下）。コードはhttps://github.com/CASE-Lab-UMD/Router-Tuning で公開されています。

English

Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at https://github.com/CASE-Lab-UMD/Router-Tuning.

ルーターチューニング：トランスフォーマーにおけるダイナミックデプスを可能にするためのシンプルで効果的なアプローチ

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

要旨

Support