VIBEVOICE-ASR 기술 보고서

초록

본 보고서는 VibeVoice를 기반으로 구축된 범용 음성 이해 프레임워크인 VibeVoice-ASR을 소개한다. 이 프레임워크는 단기 음성 인식의 최근 발전에도 불구하고 여전히 해결되지 않은 장시간 오디오(회의, 팟캐스트 등)의 맥락 단편화 및 다중 화자 복잡성 문제를 해결하기 위해 설계되었다. 오디오 청킹에 의존하는 기존의 파이프라인 방식과 달리, VibeVoice-ASR은 최대 60분 오디오에 대한 단일 패스 처리를 지원한다. 이는 자동 음성 인식, 화자 분할, 타임스탬프 생성을 단일 종단 간 생성 작업으로 통합한다. 또한 VibeVoice-ASR은 50개 이상의 언어를 지원하며 명시적인 언어 설정이 필요 없고, 발화 내 및 발화 간 코드 전환을 기본적으로 처리한다. 나아가, 사용자가 맞춤형 맥락을 제공하여 도메인 특화 용어의 정확도와 다의어 문자 디스엠비귤레이션을 크게 향상시키는 프롬프트 기반 맥락 주입 메커니즘을 도입하였다.

English

This report presents VibeVoice-ASR, a general-purpose speech understanding framework built upon VibeVoice, designed to address the persistent challenges of context fragmentation and multi-speaker complexity in long-form audio (e.g., meetings, podcasts) that remain despite recent advancements in short-form speech recognition. Unlike traditional pipelined approaches that rely on audio chunking, VibeVoice-ASRsupports single-pass processing for up to 60 minutes of audio. It unifies Automatic Speech Recognition, Speaker Diarization, and Timestamping into a single end-to-end generation task. In addition, VibeVoice-ASR supports over 50 languages, requires no explicit language setting, and natively handles code-switching within and across utterances. Furthermore, we introduce a prompt-based context injection mechanism that allows users to supply customized conetxt, significantly improving accuracy on domain-specific terminology and polyphonic character disambiguation.

VIBEVOICE-ASR 기술 보고서

VIBEVOICE-ASR Technical Report

초록

Support