ChatPaper.aiChatPaper

OWSM v3.1:基於E-Branchformer的更好更快的開放式耳語風格語音模型

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

January 30, 2024
作者: Yifan Peng, Jinchuan Tian, William Chen, Siddhant Arora, Brian Yan, Yui Sudo, Muhammad Shakeel, Kwanghee Choi, Jiatong Shi, Xuankai Chang, Jee-weon Jung, Shinji Watanabe
cs.AI

摘要

最近的研究主張採用完全開放的基礎模型來促進透明度和開放科學。作為一個初始步驟,開放式Whisper風格語音模型(OWSM)使用公開可用的數據和開源工具包復制了OpenAI的Whisper。為了復制Whisper,先前的OWSM v1到v3模型仍然基於Transformer,這可能導致性能不及其他最先進的語音編碼器。在這項工作中,我們旨在提高OWSM的性能和效率,而無需額外的訓練數據。我們提出基於E-Branchformer的OWSM v3.1模型,分為100M和1B兩個規模。1B模型是迄今為止公開發布的最大基於E-Branchformer的語音模型。它在絕大多數評估基準中優於先前的OWSM v3,同時展示了高達25%的更快推理速度。我們公開發布數據準備腳本、預訓練模型和訓練日誌。
English
Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.
PDF141December 15, 2024