ChatPaper.aiChatPaper

MOSS转录与说话人日志:精准转写,发言人分离

MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization

January 4, 2026
作者: MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
cs.AI

摘要

说话人归属时间戳转录(SATS)旨在实现语音内容转写并精确定位每位发言者的时间节点,这对会议转录场景尤为重要。现有SATS系统鲜少采用端到端架构,且受限于短上下文窗口、弱长程说话人记忆能力以及无法输出时间戳等瓶颈。为突破这些限制,我们提出MOSS Transcribe Diarize——一个统一的多模态大语言模型,以端到端方式联合实现说话人归属与时间戳转录。该模型基于海量真实场景数据训练,具备128k上下文窗口可处理长达90分钟的输入,展现出优异的扩展性和鲁棒泛化能力。在全面评估中,其在多个公开及内部基准测试上均超越当前最先进的商业系统。
English
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.
PDF452January 8, 2026