StepAudio 2.5 기술 보고서

초록

통합 오디오-언어 모델링은 현대 음성 시스템의 주요 트렌드로 부상하며, 대규모 언어 모델의 추론 능력을 청각 작업에 접목할 것을 약속한다. 그러나 기존의 통합 기반 모델들은 자동 음성 인식(ASR), 텍스트-음성 합성(TTS), 실시간 음성 상호작용에 걸쳐 특화된 시스템의 수준에 미치지 못하는 경우가 많다. 이러한 격차를 해소하는 것은 여전히 해결되지 않은 과제로 남아 있다. 본 보고서는 세 가지 능력 모두에서 특화 시스템에 필적하거나 능가하는 통합 오디오-언어 기반 모델인 StepAudio 2.5를 제시한다. 우리는 이러한 작업들을 구조적으로 구분된 것으로 간주하지 않고, 텍스트와 오디오가 다중 양식 표현 공간을 공유하게 되면 작업 특화는 데이터 구성, 최적화 목표, 디코딩 제약이라는 운영 체제의 문제가 된다는 전제에서 출발한다. 이 통찰에 기반하여, 우리는 사후 학습 패러다임을 표준 지도 학습에서 작업 맞춤형 인간 피드백 기반 강화 학습(RLHF)으로 발전시키고, 이를 복잡한 최적화 목표를 정의하는 주요 메커니즘으로 사용한다. 우리는 RLHF 중심의 정렬을 특화된 디코딩과 함께 활용하여 공유된 백본을 세 가지 독립적인 운영 모드로 형성한다. 구체적으로, ASR 브랜치는 검증 가능한 다중 토큰 디코딩을 통해 전사 효율성을 향상시키고, TTS 브랜치는 선호 기반 RLHF와 맥락이 풍부한 감독을 통해 제어 가능하고 표현력 있는 합성을 실현하며, 실시간 브랜치는 RLHF 프레임워크 내에서 생성적 보상 모델링을 통해 저지연, 개인 일관성 대화를 구현한다. 표준 벤치마크에서 StepAudio 2.5는 ASR, TTS, 실시간 작업 전반에 걸쳐 최첨단 결과를 달성하며, 단일 오디오-언어 기반 모델이 음성 이해, 생성 및 실시간 상호작용의 서로 다른 배치 목표를 성공적으로 내재화할 수 있음을 입증한다.

English

Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.