Step-Audio-R1技术报告

摘要

近期推理模型的进展通过扩展的思维链推演，在文本和视觉领域取得了显著成功。然而音频语言模型领域始终存在一个令人困惑的现象：模型在使用极少甚至无需推理的情况下表现更佳，这引发了一个根本性问题——音频智能是否真能从深思熟虑中受益？我们推出Step-Audio-R1，这是首个成功解锁音频领域推理能力的音频推理模型。通过我们提出的模态锚定推理蒸馏框架，该模型学会了生成与音频相关的推理链，使其真正扎根于声学特征而非产生脱节的虚构建构。我们的模型展现出强大的音频推理能力，在涵盖语音、环境声与音乐的综合音频理解与推理基准测试中，不仅超越Gemini 2.5 Pro，更达到与最先进的Gemini 3 Pro相媲美的性能。这些结果表明，当推理能力被恰当锚定时，可成为跨模态的可迁移能力，从而将扩展推演从音频智能的负担转化为强大资产。通过建立首个成功的音频推理模型，Step-Audio-R1为构建真正跨所有感知模态进行深度思考的多模态推理系统开辟了新路径。

English

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

Step-Audio-R1技术报告

Step-Audio-R1 Technical Report

摘要

Support