LiteASR: 低ランク近似による効率的な自動音声認識

要旨

現代の自動音声認識（ASR）モデル、例えばOpenAIのWhisperは、深層エンコーダ-デコーダアーキテクチャに依存しており、そのエンコーダは高い計算負荷のため効率的な展開における重要なボトルネックとなっています。本論文では、LiteASRを紹介します。これはASRエンコーダのための低ランク圧縮スキームであり、転写精度を維持しながら推論コストを大幅に削減します。我々のアプローチは、中間活性化において観察される強力な低ランク特性を活用しています。小さなキャリブレーションデータセットを用いた主成分分析（PCA）を適用することで、線形変換を低ランク行列乗算の連鎖で近似し、さらに自己注意機構を低次元で動作するように最適化します。評価結果は、本手法がWhisper large-v3のエンコーダサイズを50%以上圧縮し、Whisper mediumのサイズに匹敵しながらより優れた転写精度を達成し、効率と性能の新たなパレート最適フロンティアを確立することを示しています。LiteASRのコードはhttps://github.com/efeslab/LiteASRで公開されています。

English

Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

LiteASR: 低ランク近似による効率的な自動音声認識

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

要旨

Support