ChatPaper.aiChatPaper

自监督语音模型的界面设计

Interface Design for Self-Supervised Speech Models

June 18, 2024
作者: Yi-Jen Shih, David Harwath
cs.AI

摘要

最近,自监督语音(SSL)模型已被广泛应用于许多下游语音处理任务。一般的使用模式是将SSL模型用作特征提取器,然后训练一个下游预测头来解决特定任务。然而,已经证明SSL模型的不同层捕获了不同类型的信息,而它们的组合方法尚未得到充分研究。为此,我们通过提出连接上游和下游的接口来扩展SSL模型利用的一般框架。在这个视角下,通过逐层加权求和结合特征的主要技术可以被视为一种特定接口。我们提出了几种替代接口设计,并证明加权求和接口对许多任务来说并不是最佳选择。特别地,我们展示了一个卷积接口,其深度随着上游模型的深度对数级地扩展,始终优于许多其他接口设计。
English
Self-supervised speech (SSL) models have recently become widely adopted for many downstream speech processing tasks. The general usage pattern is to employ SSL models as feature extractors, and then train a downstream prediction head to solve a specific task. However, different layers of SSL models have been shown to capture different types of information, and the methods of combining them are not well studied. To this end, we extend the general framework for SSL model utilization by proposing the interface that connects the upstream and downstream. Under this view, the dominant technique of combining features via a layerwise weighted sum can be regarded as a specific interface. We propose several alternative interface designs and demonstrate that the weighted sum interface is suboptimal for many tasks. In particular, we show that a convolutional interface whose depth scales logarithmically with the depth of the upstream model consistently outperforms many other interface designs.

Summary

AI-Generated Summary

PDF71December 3, 2024