ChatPaper.aiChatPaper

StepAudio 2.5 技术报告

StepAudio 2.5 Technical Report

May 22, 2026
作者: Bin Lin, Bo Zhao, Boyong Wu, Chao Yan, Chen Wu, Cheng Yi, Chengyuan Yao, Daijiao Liu, Fei Tian, Feng Tian, Haiyang Sun, Haoyang Zhang, Jiangjie Zhen, Jinglan Gong, Jun Chen, Li Xie, Peilin Li, Peng Yang, Pengfei Tan, Qingjian Lin, Runze Li, Shenghua Hu, Siyi Zhou, Wenwen Qu, Xiangyu Li, Xiangyu Tony Zhang, Xuerui Yang, Yang Yang, Yechang Huang, Yu Fu, Yuchu Luo, Yuxin Li, Yuxin Zhang, Zhengyan Sheng, Brian Li, Chang Zeng, Changlin Zhang, Chen Geng, Chenghao Dong, Chengli Feng, Dan Zhou, Danni Wan, Di Chen, Die Zhang, Dongqing Pang, Guanglong Yang, Guoqiang Hu, Huangxi Zhu, Jianzheng Gao, Jinghua Liang, Jinmei Wan, Junjie Yuan, Kang An, Lei Lei, Limin Zhong, Lun Cai, Mengqiang Ren, Min Xu, Mingliang Li, Mingxiao Li, Na Wang, Qiang Tong, Qiaoling Huang, Qingfu Du, Rui Wang, Shengchen Zhou, Shi Qiu, Shihao Peng, Shiliang Yang, Siqi Tu, Tianjiao Deng, Ting Xu, Tong Wang, WeiMing Niu, Wuxun Xie, Xianwei Zhang, Xianyu Feng, Xiaojia Liu, Xing Chen, Xiongbin Wu, Yan Wu, Yang Li, Yi Liu, Yifan Zhang, Yile Liu, Yongshen Long, Yu Luo, Yuanhao Ding, Yuhao Wang, Yuhe Yin, Yunfang Xu, Yuxiang Yang, Zhiguo Huang, Zhiyue Wu, Zichao Li, Zichao Zhou, Daxin Jiang, Future Li, Gang Yu, Xiangyu Zhang, Yibo Zhu
cs.AI

摘要

统一音频语言建模已成为现代语音系统的主流趋势,有望将大语言模型的推理能力拓展至听觉任务。然而,现有统一基础模型在自动语音识别(ASR)、文本到语音合成(TTS)及实时语音交互等领域的深度上,往往难以匹敌专用系统。弥合这一差距仍是悬而未决的挑战。本报告提出StepAudio 2.5,一个在以上三类能力上均达到或超越专用系统的统一音频语言基础模型。我们并非将这些任务视为架构上彼此独立,而是基于一个前提:一旦文本与音频共享多模态表示空间,任务专门化便成为操作范式的问题——即数据构建、优化目标与解码约束。在这一洞见指导下,我们将后训练范式从标准监督学习推进至任务定制的基于人类反馈的强化学习(RLHF),将其作为定义复杂优化目标的核心机制。我们借助以RLHF为中心的对齐策略,配合专门化解码,将共享骨干模型塑造为三种不同的操作模式。具体而言,ASR分支通过可验证的多令牌解码提升转录效率;TTS分支通过基于偏好的RLHF与上下文丰富的监督实现可控且富有表现力的合成;实时分支则通过RLHF框架中的生成式奖励建模,实现低延迟、人格一致的对话。在标准基准测试中,StepAudio 2.5在ASR、TTS及实时任务上均达到最先进水平,证明单一音频语言基础模型能够成功内化语音理解、生成与实时交互的差异化部署目标。
English
Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.