ChatPaper.aiChatPaper

VIBEVOICE-ASR 技术报告

VIBEVOICE-ASR Technical Report

January 26, 2026
作者: Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao, Weijiang Xu, Yi Zhu, Zehua Wang, Ting Song, Yan Xia, Zewen Chi, Shaohan Huang, Liang Wang, Chuang Ding, Shuai Wang, Xie Chen, Furu Wei
cs.AI

摘要

本報告介紹VibeVoice-ASR——一個基於VibeVoice構建的通用語音理解框架,旨在解決長音頻(如會議、播客)中儘管短語音辨識技術已有進展,但語境碎片化與多說話者複雜性仍持續存在的難題。有別於依賴音頻分塊的傳統流水線方法,VibeVoice-ASR支援對長達60分鐘音頻進行單次處理,將自動語音辨識、說話者日誌化與時間標記統一整合為單一端到端生成任務。此外,VibeVoice-ASR支援超過50種語言,無需明確設定語言參數,並能原生處理語句內外的語碼轉換。我們還引入了基於提示的語境注入機制,允許用戶提供自定義上下文,顯著提升領域專業術語的辨識準確度與多音字消歧能力。
English
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.
PDF111January 28, 2026