MOSS語音轉寫與說話者分離:精準轉錄與多說話者辨識
MOSS Transcribe Diarize: Accurate Transcription with Speaker Diarization
January 4, 2026
作者: MOSI. AI, Donghua Yu, Zhengyuan Lin, Chen Yang, Yiyang Zhang, Hanfu Chen, Jingqi Chen, Ke Chen, Liwei Fan, Yi Jiang, Jie Zhu, Muchen Li, Wenxuan Wang, Yang Wang, Zhe Xu, Yitian Gong, Yuqian Zhang, Wenbo Zhang, Zhaoye Fei, Qinyuan Cheng, Shimin Li, Xipeng Qiu
cs.AI
摘要
說話人歸屬時間戳轉寫技術旨在精準轉錄語音內容並標定每位說話人的發言時段,這對會議轉錄尤為重要。現有系統鮮少採用端到端架構,且受制於有限的上下文窗口、薄弱的長程說話人記憶能力以及無法輸出時間戳等侷限。為突破這些限制,我們提出MOSS Transcribe Diarize——一個統一的模態大語言模型,能以端到端模式聯合實現說話人歸屬與時間戳轉寫。該模型基於海量真實場景數據訓練,具備128k上下文窗口可處理長達90分鐘的輸入,展現出優異的擴展性與強健的泛化能力。在全面評估中,其於多個公開及內部基準測試上均超越現有頂尖商業系統。
English
Speaker-Attributed, Time-Stamped Transcription (SATS) aims to transcribe what is said and to precisely determine the timing of each speaker, which is particularly valuable for meeting transcription. Existing SATS systems rarely adopt an end-to-end formulation and are further constrained by limited context windows, weak long-range speaker memory, and the inability to output timestamps. To address these limitations, we present MOSS Transcribe Diarize, a unified multimodal large language model that jointly performs Speaker-Attributed, Time-Stamped Transcription in an end-to-end paradigm. Trained on extensive real wild data and equipped with a 128k context window for up to 90-minute inputs, MOSS Transcribe Diarize scales well and generalizes robustly. Across comprehensive evaluations, it outperforms state-of-the-art commercial systems on multiple public and in-house benchmarks.