Step-Audio-R1.5 技术报告
Step-Audio-R1.5 Technical Report
April 28, 2026
作者: Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, Fei Tian, Yayue Deng, Jun Chen, Qingjian Lin, Haoyang Zhang, Yuxin Li, Jinglan Gong, Yechang Huang, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Gang Yu, Xiangyu Zhang, Daxin Jiang
cs.AI
摘要
近期大型音频语言模型的进展已将思维链推理扩展至听觉领域,使模型能够处理日益复杂的声学与语音任务。受文本推理模型成功经验的驱动,当前主流范式过度依赖验证奖励强化学习来激发并维持这些扩展推理链。然而,当模型被严格优化以将丰富连续的听觉语境提炼为孤立可验证的文本标签时,一个根本性问题随之产生:我们究竟是在培育真正的音频智能,还是将连续感知媒介降维成离散谜题?我们将此称为"可验证奖励陷阱"。虽然验证奖励强化学习在标准化客观基准测试中表现卓越,但它系统性地削弱了音频模型在真实场景中的对话质感。通过孤立正确性优先于声学细微差别的优化方式,该方法将动态交互简化为机械的"应答机器",严重损害了韵律自然度、情感连续性及用户沉浸感,尤其在长轮对话中更为明显。为弥合机械客观验证与真实感官共情之间的鸿沟,我们推出Step-Audio-R1.5模型,标志着音频推理向人类反馈强化学习的范式转变。综合评估表明,Step-Audio-R1.5不仅保持了强大的分析推理能力,更深刻重塑了交互体验,为深度沉浸式长轮语音对话重新划定了能力边界。
English
Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.