OWSM v3.1: E-Branchformerに基づく、より優れた高速なOpen Whisperスタイル音声モデル

要旨

最近の研究では、透明性とオープンサイエンスを促進するために、完全にオープンな基盤モデルを提唱しています。最初のステップとして、Open Whisper-style Speech Model (OWSM)は、公開されているデータとオープンソースのツールキットを使用してOpenAIのWhisperを再現しました。Whisperの再現を目指して、以前のOWSM v1からv3までのモデルは依然としてTransformerに基づいていましたが、これは他の最先端の音声エンコーダと比較して性能が劣る可能性があります。本研究では、追加のトレーニングデータなしでOWSMの性能と効率を向上させることを目指しています。私たちは、100Mと1Bの2つのスケールでE-BranchformerベースのOWSM v3.1モデルを提案します。1Bモデルは、公開されている中で最大のE-Branchformerベースの音声モデルです。これは、以前のOWSM v3を大多数の評価ベンチマークで上回り、最大25%高速な推論速度を示しています。私たちは、データ準備スクリプト、事前学習済みモデル、およびトレーニングログを公開しています。

English

Recent studies have advocated for fully open foundation models to promote transparency and open science. As an initial step, the Open Whisper-style Speech Model (OWSM) reproduced OpenAI's Whisper using publicly available data and open-source toolkits. With the aim of reproducing Whisper, the previous OWSM v1 through v3 models were still based on Transformer, which might lead to inferior performance compared to other state-of-the-art speech encoders. In this work, we aim to improve the performance and efficiency of OWSM without extra training data. We present E-Branchformer based OWSM v3.1 models at two scales, i.e., 100M and 1B. The 1B model is the largest E-Branchformer based speech model that has been made publicly available. It outperforms the previous OWSM v3 in a vast majority of evaluation benchmarks, while demonstrating up to 25% faster inference speed. We publicly release the data preparation scripts, pre-trained models and training logs.

OWSM v3.1: E-Branchformerに基づく、より優れた高速なOpen Whisperスタイル音声モデル

OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer

要旨

Support