Pisets: 강의 및 인터뷰를 위한 강건한 음성 인식 시스템

초록

본 연구는 과학자와 언론인을 위한 음성-텍스트 변환 시스템 "Pisets"를 소개한다. 이 시스템은 Whisper 모델과 관련된 오인식과 허구적 생성(hallucination)을 최소화하면서 음성 인식 정확도를 향상시키기 위해 설계된 3단계 구성 요소 아키텍처를 기반으로 한다. 해당 아키텍처는 Wav2Vec2를 이용한 1차 인식, Audio Spectrogram Transformer(AST)를 통한 오인식 필터링, 그리고 Whisper를 활용한 최종 음성 인식으로 구성된다. 커리큘럼 학습 방법의 도입과 다양한 러시아어 음성 코퍼스의 활용으로 시스템의 성능이 크게 향상되었다. 또한, 향상된 불확실성 모델링 기법이 도입되어 음성 기록 품질의 추가 개선에 기여하였다. 제안된 접근법은 WhisperX와 일반 Whisper 모델 대비 다양한 음향 환경에서 긴 오디오 데이터의 강건한 기록 생성을 보장한다. "Pisets" 시스템의 소스 코드는 GitHub(https://github.com/bond005/pisets)에서 공개되어 있다.

English

This work presents a speech-to-text system "Pisets" for scientists and journalists which is based on a three-component architecture aimed at improving speech recognition accuracy while minimizing errors and hallucinations associated with the Whisper model. The architecture comprises primary recognition using Wav2Vec2, false positive filtering via the Audio Spectrogram Transformer (AST), and final speech recognition through Whisper. The implementation of curriculum learning methods and the utilization of diverse Russian-language speech corpora significantly enhanced the system's effectiveness. Additionally, advanced uncertainty modeling techniques were introduced, contributing to further improvements in transcription quality. The proposed approaches ensure robust transcribing of long audio data across various acoustic conditions compared to WhisperX and the usual Whisper model. The source code of "Pisets" system is publicly available at GitHub: https://github.com/bond005/pisets.

Pisets: 강의 및 인터뷰를 위한 강건한 음성 인식 시스템

Pisets: A Robust Speech Recognition System for Lectures and Interviews

초록

Support