ChatPaper.aiChatPaper

Voila:面向实时自主交互与语音角色扮演的语音-语言基础模型

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

May 5, 2025
作者: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu
cs.AI

摘要

一款能无缝融入日常生活的语音AI助手,将以自主、实时且富有情感表达的方式与人类互动。它不仅仅是对指令作出反应,而是持续倾听、推理并主动回应,促成流畅、动态且情感共鸣的交互体验。我们推出了Voila,这是一系列大型语音语言基础模型,朝着这一愿景迈出了重要一步。Voila摒弃了传统的流水线系统,采用全新的端到端架构,实现了全双工、低延迟的对话,同时保留了丰富的语音细节,如语调、节奏和情感。其响应延迟仅为195毫秒,超越了人类的平均反应时间。通过层次化的多尺度Transformer架构,Voila将大型语言模型(LLMs)的推理能力与强大的声学建模相结合,实现了自然、角色感知的语音生成——用户只需通过文本指令即可定义说话者的身份、语调及其他特征。此外,Voila支持超过一百万种预制语音,并能从短至10秒的音频样本中高效定制新语音。除了口语对话,Voila还被设计为一个统一模型,适用于广泛的语音应用,包括自动语音识别(ASR)、文本到语音(TTS),以及经过少量适配的多语言语音翻译。Voila已完全开源,以支持开放研究,加速迈向下一代人机交互的进程。
English
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

Summary

AI-Generated Summary

PDF522May 6, 2025