LiteASR: 저랭크 근사를 통한 효율적 자동 음성 인식

초록

현대의 자동 음성 인식(ASR) 모델, 예를 들어 OpenAI의 Whisper는 깊은 인코더-디코더 아키텍처에 의존하며, 그 중 인코더는 높은 계산 집약성으로 인해 효율적인 배포에 있어 중요한 병목 현상을 일으킵니다. 우리는 LiteASR을 소개합니다. 이는 ASR 인코더를 위한 저랭크 압축 기법으로, 추론 비용을 크게 줄이면서도 음성 인식 정확도를 유지합니다. 우리의 접근 방식은 중간 활성화에서 관찰된 강력한 저랭크 특성을 활용합니다: 작은 캘리브레이션 데이터셋을 사용한 주성분 분석(PCA)을 적용하여 선형 변환을 저랭크 행렬 곱셈의 연쇄로 근사하고, 더 나아가 자기 주의(self-attention)를 축소된 차원에서 작동하도록 최적화합니다. 평가 결과는 우리의 방법이 Whisper large-v3의 인코더 크기를 50% 이상 압축할 수 있으며, Whisper medium의 크기에 맞추면서 더 나은 음성 인식 정확도를 달성함으로써 효율성과 성능의 새로운 파레토 최적 경계를 설정함을 보여줍니다. LiteASR의 코드는 https://github.com/efeslab/LiteASR에서 확인할 수 있습니다.

English

Modern automatic speech recognition (ASR) models, such as OpenAI's Whisper, rely on deep encoder-decoder architectures, and their encoders are a critical bottleneck for efficient deployment due to high computational intensity. We introduce LiteASR, a low-rank compression scheme for ASR encoders that significantly reduces inference costs while maintaining transcription accuracy. Our approach leverages the strong low-rank properties observed in intermediate activations: by applying principal component analysis (PCA) with a small calibration dataset, we approximate linear transformations with a chain of low-rank matrix multiplications, and further optimize self-attention to work in the reduced dimension. Evaluation results show that our method can compress Whisper large-v3's encoder size by over 50%, matching Whisper medium's size with better transcription accuracy, thereby establishing a new Pareto-optimal frontier of efficiency and performance. The code of LiteASR is available at https://github.com/efeslab/LiteASR.

LiteASR: 저랭크 근사를 통한 효율적 자동 음성 인식

LiteASR: Efficient Automatic Speech Recognition with Low-Rank Approximation

초록

Support