FlashLabs Chroma 1.0:具备个性化语音克隆功能的实时端到端口语对话模型
FlashLabs Chroma 1.0: A Real-Time End-to-End Spoken Dialogue Model with Personalized Voice Cloning
January 16, 2026
作者: Tanyu Chen, Tairan Chen, Kai Shen, Zhenghua Bao, Zhihui Zhang, Man Yuan, Yi Shi
cs.AI
摘要
近期端到端语音对话系统利用语音分词器和神经音频编解码器,使大语言模型能够直接处理离散语音表征。然而,这些模型往往存在说话人身份保持能力有限的问题,阻碍了个性化语音交互的发展。本研究推出Chroma 1.0——首个开源的实时端到端语音对话模型,兼具低延迟交互与高保真个性化语音克隆能力。通过支持流式生成的交错式文本-音频令牌调度方案(1:2),Chroma实现了亚秒级端到端延迟,并在多轮对话中保持高质量的个性化语音合成。实验结果表明,Chroma在保持强大推理和对话能力的同时,说话人相似度相较人类基线相对提升10.96%,实时因子(RTF)达0.43。相关代码与模型已开源:https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma 与 https://huggingface.co/FlashLabs/Chroma-4B。
English
Recent end-to-end spoken dialogue systems leverage speech tokenizers and neural audio codecs to enable LLMs to operate directly on discrete speech representations. However, these models often exhibit limited speaker identity preservation, hindering personalized voice interaction. In this work, we present Chroma 1.0, the first open-source, real-time, end-to-end spoken dialogue model that achieves both low-latency interaction and high-fidelity personalized voice cloning. Chroma achieves sub-second end-to-end latency through an interleaved text-audio token schedule (1:2) that supports streaming generation, while maintaining high-quality personalized voice synthesis across multi-turn conversations. Our experimental results demonstrate that Chroma achieves a 10.96% relative improvement in speaker similarity over the human baseline, with a Real-Time Factor (RTF) of 0.43, while maintaining strong reasoning and dialogue capabilities. Our code and models are publicly available at https://github.com/FlashLabs-AI-Corp/FlashLabs-Chroma and https://huggingface.co/FlashLabs/Chroma-4B .