MLC-SLM 챌린지를 위한 BUT 시스템

초록

본 논문에서는 DiCoW(Whisper의 화자 분할 조건 변형)와 Pyannote 기반의 화자 분할 파이프라인인 DiariZen을 결합한 이중 화자 자동 음성 인식(ASR) 시스템을 제안한다. 먼저, 두 시스템을 미세 조정 없이 도메인 외(OOD) 다국어 시나리오에서 평가하였다. 이 시나리오에서 DiariZen은 기준 Pyannote 화자 분할 모델을 지속적으로 능가하며 강력한 일반화 성능을 보였다. DiCoW는 목표 화자 ASR을 위해 영어 데이터만으로 미세 조정되었음에도 불구하고, 다국어 성능을 유지하며 인코더 수정이 Whisper의 다국어 능력을 보존함을 확인하였다. 이후, MLC-SLM 챌린지 데이터를 활용하여 DiCoW와 DiariZen을 미세 조정하였다. 미세 조정된 DiariZen은 여전히 미세 조정된 Pyannote 기준 모델을 능가했으며, DiCoW는 도메인 적응을 통해 추가적인 성능 향상을 보였다. 최종 시스템은 16.75%의 마이크로 평균 tcpWER/CER을 달성하며 MLC-SLM 챌린지 Task 2에서 2위를 기록하였다. 마지막으로, 학습 데이터에서 누락된 음성 구간 및 잘못된 침묵 주석과 같은 여러 라벨링 불일치를 확인하였으며, 이러한 문제가 화자 분할 미세 조정을 방해할 수 있음을 지적하였다. 이러한 문제를 해결하고 시스템의 견고성을 향상시키기 위한 간단한 완화 전략을 제안하였다.

English

We present a two-speaker automatic speech recognition (ASR) system that combines DiCoW -- a diarization-conditioned variant of Whisper -- with DiariZen, a diarization pipeline built on top of Pyannote. We first evaluate both systems in out-of-domain (OOD) multilingual scenarios without any fine-tuning. In this scenario, DiariZen consistently outperforms the baseline Pyannote diarization model, demonstrating strong generalization. Despite being fine-tuned on English-only data for target-speaker ASR, DiCoW retains solid multilingual performance, indicating that encoder modifications preserve Whisper's multilingual capabilities. We then fine-tune both DiCoW and DiariZen on the MLC-SLM challenge data. The fine-tuned DiariZen continues to outperform the fine-tuned Pyannote baseline, while DiCoW sees further gains from domain adaptation. Our final system achieves a micro-average tcpWER/CER of 16.75% and ranks second in Task 2 of the MLC-SLM challenge. Lastly, we identify several labeling inconsistencies in the training data -- such as missing speech segments and incorrect silence annotations -- which can hinder diarization fine-tuning. We propose simple mitigation strategies to address these issues and improve system robustness.

MLC-SLM 챌린지를 위한 BUT 시스템

BUT System for the MLC-SLM Challenge

초록

Support