StepAudio 2.5 技術レポート

要旨

統合音声言語モデリングは、現代の音声システムにおける顕著なトレンドとして台頭しており、大規模言語モデルの推論能力を聴覚タスクにもたらすことを約束している。しかしながら、既存の統合基盤モデルは、自動音声認識（ASR）、テキスト音声合成（TTS）、リアルタイム音声対話において、専門特化型システムが持つ深みにまで到達するのが難しいのが現状である。このギャップを埋めることは、現在も未解決の課題である。本報告書は、これら三つの能力すべてにおいて専門特化型システムに匹敵、あるいは凌駕する統合音声言語基盤モデル、StepAudio 2.5 を紹介する。我々は、これらのタスクをアーキテクチャ的に異なるものとして扱うのではなく、テキストと音声がマルチモーダルな表現空間を共有するならば、タスクの特化は動作レジーム（データ構築、最適化目標、復号化制約）の問題となるという前提に基づいて研究を進めている。この洞察に導かれ、我々はポストトレーニングパラダイムを標準的な教師あり学習からタスクに特化したRLHF（人間からのフィードバックによる強化学習）へと発展させ、それを複雑な最適化目標を定義する主要なメカニズムとして用いる。我々は、このRLHF中心のアライメントを、特殊な復号化と組み合わせて活用し、共有バックボーンを三つの異なる動作モードへと形成する。具体的には、ASRブランチは、検証可能なマルチトークン復号化により転写効率を向上させる。TTSブランチは、嗜好に基づくRLHFと文脈豊かな教師信号を通じて、制御可能で表現力豊かな合成を実現する。リアルタイムブランチは、RLHFフレームワーク内での生成的報酬モデリングにより、低遅延かつ人物像に一貫した対話を実現する。標準ベンチマークにおいて、StepAudio 2.5 はASR、TTS、リアルタイムの各タスクで最先端の結果を達成し、単一の音声言語基盤が、音声理解、生成、ライブ対話というそれぞれ異なる展開上の目的を首尾よく内面化できることを実証している。

English

Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.