ChatPaper.aiChatPaper

Voila:即時自主互動與語音角色扮演的語音-語言基礎模型

Voila: Voice-Language Foundation Models for Real-Time Autonomous Interaction and Voice Role-Play

May 5, 2025
作者: Yemin Shi, Yu Shu, Siwei Dong, Guangyi Liu, Jaward Sesay, Jingwen Li, Zhiting Hu
cs.AI

摘要

一款無縫融入日常生活的語音AI助手,將以自主、即時且富有情感表達的方式與人類互動。它不僅僅是對指令做出反應,而是持續傾聽、推理並主動回應,促進流暢、動態且情感共鳴的互動。我們推出Voila,這是一系列大型語音語言基礎模型,朝著這一願景邁出了一步。Voila超越了傳統的流水線系統,採用了全新的端到端架構,實現了全雙工、低延遲的對話,同時保留了豐富的語音細微差別,如語調、節奏和情感。其響應延遲僅為195毫秒,超越了人類的平均反應時間。其分層多尺度Transformer將大型語言模型(LLMs)的推理能力與強大的聲學建模相結合,實現了自然、個性化的語音生成——用戶只需編寫文本指令即可定義說話者的身份、語調和其他特徵。此外,Voila支持超過一百萬種預建語音,並能從短至10秒的音頻樣本中高效定制新語音。除了口語對話,Voila還被設計為適用於多種語音應用的統一模型,包括自動語音識別(ASR)、文本到語音(TTS),以及經過最小適應的多語言語音翻譯。Voila完全開源,以支持開放研究並加速下一代人機交互的進程。
English
A voice AI agent that blends seamlessly into daily life would interact with humans in an autonomous, real-time, and emotionally expressive manner. Rather than merely reacting to commands, it would continuously listen, reason, and respond proactively, fostering fluid, dynamic, and emotionally resonant interactions. We introduce Voila, a family of large voice-language foundation models that make a step towards this vision. Voila moves beyond traditional pipeline systems by adopting a new end-to-end architecture that enables full-duplex, low-latency conversations while preserving rich vocal nuances such as tone, rhythm, and emotion. It achieves a response latency of just 195 milliseconds, surpassing the average human response time. Its hierarchical multi-scale Transformer integrates the reasoning capabilities of large language models (LLMs) with powerful acoustic modeling, enabling natural, persona-aware voice generation -- where users can simply write text instructions to define the speaker's identity, tone, and other characteristics. Moreover, Voila supports over one million pre-built voices and efficient customization of new ones from brief audio samples as short as 10 seconds. Beyond spoken dialogue, Voila is designed as a unified model for a wide range of voice-based applications, including automatic speech recognition (ASR), Text-to-Speech (TTS), and, with minimal adaptation, multilingual speech translation. Voila is fully open-sourced to support open research and accelerate progress toward next-generation human-machine interactions.

Summary

AI-Generated Summary

PDF522May 6, 2025