Step-Audio-R1.5 기술 보고서

초록

대규모 오디오 언어 모델의 최근 발전은 사고 연쇄(Chain-of-Thought, CoT) 추론을 청각 영역으로 확장하여 모델이 점점 더 복잡한 음향 및 음성 작업을 처리할 수 있게 했습니다. 이러한 확장된 추론 사슬을 이끌어내고 유지하기 위한 주류 패러다임은 텍스트 기반 추론 모델의 성공에 힘입어 검증된 보상 강화 학습(Reinforcement Learning with Verified Rewards, RLVR)에 압도적으로 의존하고 있습니다. 그러나 모델이 풍부하고 연속적인 청각 컨텍스트를 고립되고 검증 가능한 텍스트 라벨로 정제하도록 엄격하게 최적화됨에 따라 근본적인 의문이 제기됩니다. 우리가 진정한 오디오 지능을培育하고 있는 것인지, 아니면 연속적인 감각 매체를 이산적인 퍼즐로 단순 축소하고 있는 것인지 말입니다. 우리는 이를 "검증 가능한 보상 함정(verifiable reward trap)"으로 규정합니다. RLVR은 표준화된 객관적 벤치마크에서 놀라운 점수를 내지만, 오디오 모델의 실제 대화 감각을 체계적으로 저하시킵니다. RLVR은 음향적 뉘앙스보다 고립된 정확성을 우선시함으로써 역동적인 상호작용을 기계적인 "응답 기계"로 전락시키고, 특히 장문 대화에서 운율 자연스러움, 정서적 연속성, 사용자 몰입감을 심각하게 훼손합니다. 기계적인 객관적 검증과 진정한 감각적 공감 간의 격차를 해소하기 위해 우리는 Step-Audio-R1.5를 소개하며, 오디오 추론에 있어 인간 피드백 강화 학습(Reinforcement Learning from Human Feedback, RLHF)으로의 패러다임 전환을 표합니다. 포괄적인 평가 결과, Step-Audio-R1.5는 강력한 분석적 추론 능력을 유지할 뿐만 아니라 상호작용 경험을 근본적으로 변혁하여 깊이 몰입되는 장문 음성 대화의 경계를 재정의함을 보여줍니다.

English

Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.

Step-Audio-R1.5 기술 보고서

Step-Audio-R1.5 Technical Report

초록

Support