Pseudo2Real: 자동 음성 인식을 위한 의사 레이블 보정을 위한 태스크 산술

초록

도메인 변화 상황에서 강인한 자동 음성 인식(ASR)은 실세계 시스템이 제한된 레이블 데이터와 함께 보지 못한 억양 및 도메인을 마주하기 때문에 매우 중요합니다. 의사 레이블링은 실용적인 해결책을 제공하지만, 종종 필터링으로 해결할 수 없는 체계적이고 억양 특화된 오류를 유발합니다. 우리는 다음과 같은 질문을 던집니다: 타겟 실측 데이터 없이 이러한 반복적인 편향을 어떻게 교정할 수 있을까? 우리는 간단한 파라미터 공간 교정 방법을 제안합니다: 실제 데이터와 의사 레이블 데이터를 모두 포함하는 소스 도메인에서, 동일한 초기화로부터 두 개의 ASR 모델을 미세 조정합니다. 하나는 실측 레이블을 사용하고, 다른 하나는 의사 레이블을 사용하며, 이들의 가중치 차이는 의사 레이블 편향을 포착하는 교정 벡터를 형성합니다. 이 벡터를 의사 레이블된 타겟 모델에 적용하면, Whisper tiny 모델을 사용하여 10개의 아프리카 억양에 걸쳐 AfriSpeech-200에서 최대 35%의 상대적 단어 오류율(WER) 감소를 달성하며 인식 성능을 향상시킵니다.

English

Robust ASR under domain shift is crucial because real-world systems encounter unseen accents and domains with limited labeled data. Although pseudo-labeling offers a practical workaround, it often introduces systematic, accent-specific errors that filtering fails to fix. We ask: How can we correct these recurring biases without target ground truth? We propose a simple parameter-space correction: in a source domain containing both real and pseudo-labeled data, two ASR models are fine-tuned from the same initialization, one on ground-truth labels and the other on pseudo-labels, and their weight difference forms a correction vector that captures pseudo-label biases. When applied to a pseudo-labeled target model, this vector enhances recognition, achieving up to a 35% relative Word Error Rate (WER) reduction on AfriSpeech-200 across ten African accents with the Whisper tiny model.