VIBEVOICE-ASR技术报告
VIBEVOICE-ASR Technical Report
January 26, 2026
作者: Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, Yujie Tu, Chenyu Yang, Wenhui Wang, Songchen Xu, Yutao Sun, Hangbo Bao, Weijiang Xu, Yi Zhu, Zehua Wang, Ting Song, Yan Xia, Zewen Chi, Shaohan Huang, Liang Wang, Chuang Ding, Shuai Wang, Xie Chen, Furu Wei
cs.AI
摘要
本报告介绍VibeVoice-ASR——一个基于VibeVoice构建的通用语音理解框架,旨在解决长音频(如会议、播客)处理中持续存在的语境碎片化和多说话人复杂性挑战,这些挑战在短语音识别技术取得长足进步后依然存在。与传统依赖音频分块的流水线方法不同,VibeVoice-ASR支持对长达60分钟的音频进行单次处理,将自动语音识别、说话人日志和时间戳标注统一为端到端的生成任务。该系统支持50多种语言,无需显式设置语言参数,并能原生处理语句内及跨语句的语码转换。此外,我们引入了基于提示的上下文注入机制,允许用户提供定制化上下文,显著提升了领域专有术语的识别准确度和多音字消歧能力。
English
This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.