자기 지도 학습 음성 모델을 위한 인터페이스 설계

초록

자기 지도 학습 음성(Self-supervised speech, SSL) 모델은 최근 다양한 하위 작업 음성 처리 작업에 널리 채택되고 있습니다. 일반적인 사용 패턴은 SSL 모델을 특징 추출기로 활용한 후, 특정 작업을 해결하기 위해 하위 예측 헤드를 학습시키는 것입니다. 그러나 SSL 모델의 서로 다른 계층들이 다양한 유형의 정보를 포착한다는 것이 밝혀졌으며, 이를 결합하는 방법은 잘 연구되지 않았습니다. 이를 위해 우리는 상위 모델과 하위 모델을 연결하는 인터페이스를 제안함으로써 SSL 모델 활용의 일반적인 프레임워크를 확장합니다. 이러한 관점에서, 계층별 가중 합을 통해 특징을 결합하는 주요 기술은 특정 인터페이스로 간주될 수 있습니다. 우리는 여러 대안적인 인터페이스 설계를 제안하고, 가중 합 인터페이스가 많은 작업에서 최적이 아니라는 것을 입증합니다. 특히, 상위 모델의 깊이에 대해 로그적으로 깊이가 조정되는 합성곱 인터페이스가 다른 많은 인터페이스 설계를 일관되게 능가한다는 것을 보여줍니다.

English

Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not well studied. To this end, we extend the general framework for SSL model utilization by proposing the interface that connects the upstream and downstream. Under this view, the dominant technique of combining features via a layerwise weighted sum can be regarded as a specific interface. We propose several alternative interface designs and demonstrate that the weighted sum interface is suboptimal for many tasks. In particular, we show that a convolutional interface whose depth scales logarithmically with the depth of the upstream model consistently outperforms many other interface designs.

자기 지도 학습 음성 모델을 위한 인터페이스 설계

Interface Design for Self-Supervised Speech Models

초록

Support