Pisets：面向讲座与访谈的鲁棒性语音识别系统

摘要

本研究提出了一款面向科研人员与新闻工作者的语音转写系统"Pisets"，该系统采用三模块架构以提升语音识别准确率，同时减少Whisper模型常见的误识别与幻听现象。该架构包含基于Wav2Vec2的初级识别、通过音频谱图变换器（AST）的误报过滤以及最终经由Whisper完成的语音识别三个核心环节。通过实施课程学习策略并融合多源俄语语音语料库，系统效能得到显著提升。此外，引入先进的不确定性建模技术进一步优化了转录质量。与WhisperX及标准Whisper模型相比，本方案能稳健处理不同声学环境下的长音频数据转录任务。"Pisets"系统的源代码已在GitHub平台开源：https://github.com/bond005/pisets。

English

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.