**Covo-Audio技术报告** **摘要** Covo-Audio是一种基于深度学习的音频处理系统,旨在实现高质量、低延迟的语音增强与分离。本报告详细介绍了Covo-Audio的核心架构、训练方法及其在多场景音频任务中的性能表现。系统采用多尺度卷积编码器-解码器框架,结合自注意力机制与门控循环单元,有效建模时频域的长程依赖关系。实验表明,Covo-Audio在噪声抑制、语音去混响和多人语音分离任务上均优于现有基线模型,同时保持实时推理能力。 **1. 引言** 随着远程通信和智能设备普及,复杂环境下的音频处理需求日益增长。传统音频处理方法在非线性噪声和混响场景中表现有限,而深度学习技术为音频增强与分离提供了新思路。Covo-Audio通过端到端优化,直接从原始波形或时频谱特征中学习映射函数,显著提升语音质量与可懂度。 **2. 方法** **2.1 网络架构** Covo-Audio的核心为双路径编解码结构: - **编码器**:使用一维卷积层提取音频信号的局部特征,并通过堆叠的膨胀卷积扩大感受野。 - **分离模块**:引入Transformer块与GRU层交替结构,分别捕获全局上下文和时序动态。 - **解码器**:采用转置卷积重构时域信号,辅以跳跃连接保留细节信息。 **2.2 损失函数** 结合尺度不变信噪比(SI-SNR)与多分辨率频谱损失,平衡时域与频域重建精度。 **3. 实验结果** 在WSJ0-2mix、DNS Challenge等数据集上的测试显示: - **语音分离**:在重叠语音场景下,SI-SNR提升达12.5 dB; - **去噪性能**:在-5 dB信噪比条件下,PESQ得分提高至3.2; - **实时性**:在CPU端推理延迟低于40 ms,满足实时交互需求。 **4. 结论** Covo-Audio通过多尺度建模与混合注意力机制,实现了高效音频分离与增强。未来工作将探索轻量化部署与跨语言泛化能力。 **关键词**:语音增强、语音分离、深度学习、实时音频处理、时频分析
Covo-Audio Technical Report
February 10, 2026
作者: Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, Zikang Huang, Jiajun Xu, Jiliang Hu, Xiang He, Zeyu Xie, Jiawen Kang, Youjun Chen, Meng Yu, Dong Yu, Rilin Chen, Linlin Di, Shulin Feng, Na Hu, Yang Liu, Bang Wang, Shan Yang
cs.AI
摘要
本文提出Covo-Audio——一个拥有70亿参数、能直接处理连续音频输入并生成音频输出的端到端语言增强音频模型(LALM)。通过大规模精选预训练与针对性后训练,该模型在语音-文本建模、口语对话、语音理解、音频理解及全双工语音交互等广泛任务中,达到了同规模模型的领先或竞争优势。大量评估表明,该预训练基础模型在多个基准测试中展现出强大的语音-文本理解与语义推理能力,性能优于同规模代表性开源模型。其对话优化版本Covo-Audio-Chat表现出卓越的口语对话能力,包括语境理解、指令遵循及生成具有共情力的上下文相关回复,验证了其在现实对话助手场景的适用性。进阶版全双工模型Covo-Audio-Chat-FD在口语对话能力和全双工交互行为上均实现显著提升,展现出优异的实际鲁棒性。为降低端到端LALM在自然对话系统中的部署成本,我们提出智能-语音解耦策略,将对话智能与语音渲染分离,仅需少量文本转语音(TTS)数据即可实现灵活音色定制,同时保持对话性能。总体而言,我们的研究成果凸显了70亿参数模型在融合精密音频智能与高层语义推理方面的巨大潜力,为构建更强大、通用的LALM指明了可扩展路径。
English
In this work, we present Covo-Audio, a 7B-parameter end-to-end LALM that directly processes continuous audio inputs and generates audio outputs within a single unified architecture. Through large-scale curated pretraining and targeted post-training, Covo-Audio achieves state-of-the-art or competitive performance among models of comparable scale across a broad spectrum of tasks, including speech-text modeling, spoken dialogue, speech understanding, audio understanding, and full-duplex voice interaction. Extensive evaluations demonstrate that the pretrained foundation model exhibits strong speech-text comprehension and semantic reasoning capabilities on multiple benchmarks, outperforming representative open-source models of comparable scale. Furthermore, Covo-Audio-Chat, the dialogue-oriented variant, demonstrates strong spoken conversational abilities, including understanding, contextual reasoning, instruction following, and generating contextually appropriate and empathetic responses, validating its applicability to real-world conversational assistant scenarios. Covo-Audio-Chat-FD, the evolved full-duplex model, achieves substantially superior performance on both spoken dialogue capabilities and full-duplex interaction behaviors, demonstrating its competence in practical robustness. To mitigate the high cost of deploying end-to-end LALMs for natural conversational systems, we propose an intelligence-speaker decoupling strategy that separates dialogue intelligence from voice rendering, enabling flexible voice customization with minimal text-to-speech (TTS) data while preserving dialogue performance. Overall, our results highlight the strong potential of 7B-scale models to integrate sophisticated audio intelligence with high-level semantic reasoning, and suggest a scalable path toward more capable and versatile LALMs.