Pisets：專為講座與訪談設計的穩健語音辨識系統

摘要

本研究提出一套面向科學家與記者的語音轉文字系統「Pisets」，該系統採用三組件架構，旨在提升語音辨識準確度，同時減少Whisper模型常見的錯誤與幻覺問題。此架構包含基於Wav2Vec2的初級辨識、透過音頻譜圖轉換器（AST）的偽陽性過濾，以及最終由Whisper執行的語音辨識。課程學習方法的實施與多樣化俄語語料庫的運用，顯著提升了系統效能。此外，引入先進的不確定性建模技術，進一步優化了轉錄品質。相較於WhisperX與標準Whisper模型，本研究所提方法能於各類聲學條件下，對長音頻資料實現強健的轉錄效果。「Pisets」系統源代碼已公開於GitHub：https://github.com/bond005/pisets。

English

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

Pisets：專為講座與訪談設計的穩健語音辨識系統

Pisets: A Robust Speech Recognition System for Lectures and Interviews

摘要

Support