ChatPaper.aiChatPaper

FireRedASR2S:一款业界领先的工业级一体化自动语音识别系统

FireRedASR2S: A State-of-the-Art Industrial-Grade All-in-One Automatic Speech Recognition System

March 11, 2026
作者: Kaituo Xu, Yan Jia, Kai Huang, Junjie Chen, Wenpeng Li, Kun Liu, Feng-Long Xie, Xu Tang, Yao Hu
cs.AI

摘要

我们推出FireRedASR2S——一款工业级一体化自动语音识别(ASR)系统。该系统集成了四大模块:语音识别(ASR)、语音活动检测(VAD)、口语语言识别(LID)及标点预测(Punc),所有模块在评测基准中均达到顶尖性能。FireRedASR2语音识别模块提供两种版本:FireRedASR2-LLM(参数量80亿+)与FireRedASR2-AED(参数量10亿+),支持普通话、汉语方言与口音、英语及语码转换的语音与歌唱转写。相较于前代FireRedASR,新版在识别准确率与方言口音覆盖范围上均有提升。FireRedASR2-LLM在4个普通话公开基准上平均字错误率(CER)为2.89%,在19个汉语方言与口音基准上达11.55%,性能超越豆包-ASR、Qwen3-ASR、Fun-ASR等竞品。FireRedVAD语音活动检测模块基于深度前馈序列记忆网络(DFSMN),参数量仅60万,支持流式/非流式VAD及多标签VAD(mVAD)。在FLEURS-VAD-102基准中,其帧级F1值达97.57%,AUC-ROC达99.60%,优于Silero-VAD、TEN-VAD、FunASR-VAD及WebRTC-VAD。FireRedLID语言识别模块采用编码器-解码器架构,支持100余种语言及20多种汉语方言与口音。在FLEURS(82种语言)测试中,语句级准确率达97.18%,超越Whisper与SpeechBrain。FireRedPunc标点预测模块采用BERT风格架构,支持中英文标点预测。在多领域基准测试中,其平均F1值达78.90%,显著优于FunASR-Punc(62.77%)。为促进语音处理研究,我们已在https://github.com/FireRedTeam/FireRedASR2S 开源模型权重与代码。
English
We present FireRedASR2S, a state-of-the-art industrial-grade all-in-one automatic speech recognition (ASR) system. It integrates four modules in a unified pipeline: ASR, Voice Activity Detection (VAD), Spoken Language Identification (LID), and Punctuation Prediction (Punc). All modules achieve SOTA performance on the evaluated benchmarks: FireRedASR2: An ASR module with two variants, FireRedASR2-LLM (8B+ parameters) and FireRedASR2-AED (1B+ parameters), supporting speech and singing transcription for Mandarin, Chinese dialects and accents, English, and code-switching. Compared to FireRedASR, FireRedASR2 delivers improved recognition accuracy and broader dialect and accent coverage. FireRedASR2-LLM achieves 2.89% average CER on 4 public Mandarin benchmarks and 11.55% on 19 public Chinese dialects and accents benchmarks, outperforming competitive baselines including Doubao-ASR, Qwen3-ASR, and Fun-ASR. FireRedVAD: An ultra-lightweight module (0.6M parameters) based on the Deep Feedforward Sequential Memory Network (DFSMN), supporting streaming VAD, non-streaming VAD, and multi-label VAD (mVAD). On the FLEURS-VAD-102 benchmark, it achieves 97.57% frame-level F1 and 99.60% AUC-ROC, outperforming Silero-VAD, TEN-VAD, FunASR-VAD, and WebRTC-VAD. FireRedLID: An Encoder-Decoder LID module supporting 100+ languages and 20+ Chinese dialects and accents. On FLEURS (82 languages), it achieves 97.18% utterance-level accuracy, outperforming Whisper and SpeechBrain. FireRedPunc: A BERT-style punctuation prediction module for Chinese and English. On multi-domain benchmarks, it achieves 78.90% average F1, outperforming FunASR-Punc (62.77%). To advance research in speech processing, we release model weights and code at https://github.com/FireRedTeam/FireRedASR2S.
PDF42March 15, 2026