Pisets: 講義とインタビュー向けの堅牢な音声認識システム

要旨

本論文は、科学者やジャーナリスト向けの音声認識システム「Pisets」を提案する。このシステムは、Whisperモデルに伴う誤認識や幻覚生成を最小化しつつ認識精度を向上させることを目的とした3要素アーキテクチャに基づいている。アーキテクチャは、Wav2Vec2による一次認識、Audio Spectrogram Transformer（AST）による偽陽性フィルタリング、Whisperによる最終音声認識で構成される。カリキュラム学習手法の実装と多様なロシア語音声コーパスの活用により、システムの有効性が大幅に向上した。さらに、高度な不確実性モデリング技術を導入することで、文字起こし品質の更なる改善が図られている。提案手法は、WhisperXや標準的なWhisperモデルと比較して、様々な音響条件下における長時間音声データの堅牢な文字起こしを保証する。PisetsシステムのソースコードはGitHubで公開されている：https://github.com/bond005/pisets。

English

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

Pisets: 講義とインタビュー向けの堅牢な音声認識システム

Pisets: A Robust Speech Recognition System for Lectures and Interviews

要旨

Support