端到端儿童-成人互动场景下的语音识别与说话人角色划分联合建模
End-to-End Joint ASR and Speaker Role Diarization with Child-Adult Interactions
January 25, 2026
作者: Anfeng Xu, Tiantian Feng, Somer Bishop, Catherine Lord, Shrikanth Narayanan
cs.AI
摘要
准确转录并区分儿童与成人对话的说话人角色,对于发展心理学和临床研究至关重要。然而人工标注耗时费力且难以规模化。现有自动化系统通常采用级联式说话人日志与语音识别流程,易导致错误传播。本文提出一种统一的端到端框架,通过扩展Whisper编码器-解码器架构,实现对语音识别和儿童-成人说话人角色区分的联合建模。该方案整合了四大核心要素:(一)采用序列化输出训练机制,同步生成说话人标签及起止时间戳;(二)引入轻量级帧级别日志头模块,增强编码器表征的说话人区分能力;(三)通过日志引导的静音抑制技术提升时间标注精度;(四)设计基于状态机的强制解码流程,确保输出结构合法性。在两个数据集上的综合评估表明,相较于两种级联基线模型,本方法在Whisper-small和Whisper-large模型上均实现持续显著提升,不仅获得更低的多说话人词错误率,更在说话人日志准确率方面展现竞争优势。这些发现凸显了所提出的联合建模框架在规模化生成可靠儿童-成人对话说话人标注文本方面的有效性与实用价值。相关代码及模型权重已开源。
English
Accurate transcription and speaker diarization of child-adult spoken interactions are crucial for developmental and clinical research. However, manual annotation is time-consuming and challenging to scale. Existing automated systems typically rely on cascaded speaker diarization and speech recognition pipelines, which can lead to error propagation. This paper presents a unified end-to-end framework that extends the Whisper encoder-decoder architecture to jointly model ASR and child-adult speaker role diarization. The proposed approach integrates: (i) a serialized output training scheme that emits speaker tags and start/end timestamps, (ii) a lightweight frame-level diarization head that enhances speaker-discriminative encoder representations, (iii) diarization-guided silence suppression for improved temporal precision, and (iv) a state-machine-based forced decoding procedure that guarantees structurally valid outputs. Comprehensive evaluations on two datasets demonstrate consistent and substantial improvements over two cascaded baselines, achieving lower multi-talker word error rates and demonstrating competitive diarization accuracy across both Whisper-small and Whisper-large models. These findings highlight the effectiveness and practical utility of the proposed joint modeling framework for generating reliable, speaker-attributed transcripts of child-adult interactions at scale. The code and model weights are publicly available