ChatPaper.aiChatPaper

Step-Audio-R1.5 技術報告

Step-Audio-R1.5 Technical Report

April 28, 2026
作者: Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, Fei Tian, Yayue Deng, Jun Chen, Qingjian Lin, Haoyang Zhang, Yuxin Li, Jinglan Gong, Yechang Huang, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Gang Yu, Xiangyu Zhang, Daxin Jiang
cs.AI

摘要

近期大型音訊語言模型的進展,已將思維鏈推理延伸至聽覺領域,使模型能處理日益複雜的聲學與語音任務。為激發並維持這些延伸推理鏈,當前主流範式——受文本推理模型成功經驗驅動——普遍依賴「可驗證獎勵的強化學習」。然而,當模型被嚴格優化以將豐富連續的聽覺情境壓縮為孤立可驗證的文本標籤時,一個根本問題浮現:我們究竟在培育真正的音訊智能,還是僅將連續感官媒介降維為離散謎題?我們將此現象稱為「可驗證獎勵陷阱」。儘管RLVR在標準化客觀基準測試中表現卓越,卻系統性削弱了音訊模型的真實對話質感。通過將孤立正確性置於聲學細微特徵之上,RLVR將動態互動簡化為機械的「問答機器」,嚴重損害韻律自然度、情感連續性與用戶沉浸感,尤其在長輪對話中更為明顯。為彌合機械化客觀驗證與真實感官共情之間的鴻溝,我們提出Step-Audio-R1.5框架,標誌著音訊推理向「人類反饋強化學習」的範式轉移。綜合評估表明,Step-Audio-R1.5不僅保持強健的分析推理能力,更深刻重塑互動體驗,重新定義深度沉浸式長輪語音對話的邊界。
English
Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.
PDF121April 30, 2026