OWSM v3.1：基于E-Branchformer的更好更快的开放式密语风格语音模型

摘要

最近的研究主张采用完全开放的基础模型来促进透明度和开放科学。作为一个初始步骤，开放式Whisper风格语音模型（OWSM）使用公开可获得的数据和开源工具包复制了OpenAI的Whisper。为了复制Whisper，之前的OWSM v1到v3模型仍然基于Transformer，这可能导致性能低于其他最先进的语音编码器。在这项工作中，我们旨在提高OWSM的性能和效率，而无需额外的训练数据。我们提出了基于E-Branchformer的OWSM v3.1模型，分别在100M和1B两个规模上。1B模型是迄今为止公开提供的最大的基于E-Branchformer的语音模型。它在绝大多数评估基准上优于之前的OWSM v3，同时展示了高达25%的更快推理速度。我们公开发布数据准备脚本、预训练模型和训练日志。

English

Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.

OWSM v3.1：基于E-Branchformer的更好更快的开放式密语风格语音模型

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

摘要

Support