ChatPaper.aiChatPaper

FlashLabs Chroma 1.0:具备个性化语音克隆功能的实时端到端语音对话模型

FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning

January 16, 2026
作者: Tanyu Chen, Tairan Chen, Kai Shen, Zhenghua Bao, Zhihui Zhang, Man Yuan, Yi Shi
cs.AI

摘要

近期,端到端语音对话系统通过语音分词器和神经音频编解码器技术,使大语言模型能够直接处理离散语音表征。然而,这些模型往往存在说话人身份保真度不足的问题,制约了个性化语音交互的发展。本研究推出Chroma 1.0——首个开源的实时端到端语音对话模型,在实现低延迟交互的同时兼具高保真个性化语音克隆能力。通过支持流式生成的文本-音频交错令牌调度方案(1:2比例),Chroma实现了亚秒级端到端延迟,并在多轮对话中保持高质量的个性化语音合成。实验结果表明,Chroma的说话人相似度相较人类基线相对提升10.96%,实时因子(RTF)达0.43,同时保持强大的推理与对话能力。相关代码与模型已公开于https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma 与 https://huggingface.co/FlashLabs/Chroma-4B。
English
Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .
PDF81January 23, 2026