Voila: 실시간 자율 상호작용 및 음성 롤플레이를 위한 음성-언어 기반 모델

초록

일상생활에 자연스럽게 녹아드는 음성 AI 에이전트는 인간과 자율적이고 실시간이며 감정 표현이 풍부한 방식으로 상호작용할 것입니다. 단순히 명령에 반응하는 것을 넘어, 지속적으로 듣고, 추론하며, 능동적으로 응답함으로써 유연하고 역동적이며 감정적으로 공감되는 상호작용을 조성할 것입니다. 우리는 이러한 비전을 향한 한 걸음을 내딛는 대규모 음성-언어 기반 모델 패밀리인 Voila를 소개합니다. Voila는 전통적인 파이프라인 시스템을 넘어, 새로운 종단 간(end-to-end) 아키텍처를 채택하여 풀 듀플렉스(full-duplex) 및 저지연 대화를 가능하게 하면서도 음색, 리듬, 감정과 같은 풍부한 음성 뉘앙스를 보존합니다. 이는 단 195밀리초의 응답 지연 시간을 달성하여 평균 인간 응답 시간을 능가합니다. 계층적 멀티스케일 트랜스포머는 대규모 언어 모델(LLM)의 추론 능력과 강력한 음향 모델링을 통합하여 자연스럽고 개성 인식형 음성 생성을 가능하게 합니다. 사용자는 단순히 텍스트 지시를 작성하여 화자의 정체성, 음색 및 기타 특성을 정의할 수 있습니다. 또한, Voila는 100만 개 이상의 사전 구축된 음성을 지원하며, 10초 정도의 짧은 오디오 샘플로부터 새로운 음성을 효율적으로 커스터마이징할 수 있습니다. 음성 대화를 넘어, Voila는 자동 음성 인식(ASR), 텍스트-음성 변환(TTS), 그리고 최소한의 적응만으로 다국어 음성 번역을 포함한 다양한 음성 기반 애플리케이션을 위한 통합 모델로 설계되었습니다. Voila는 오픈 소스로 공개되어 개방형 연구를 지원하고 차세대 인간-기계 상호작용을 가속화합니다.

English

A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.