Step-Audio-R1 技術報告

摘要

近期推理模型的突破性進展，通過延伸的思維鏈推演在文本與視覺領域取得了顯著成就。然而音頻語言模型領域卻存在一個令人困惑的現象：模型在極簡或無需推理的情況下表現更優，這引發了根本性疑問——音頻智能是否真能受益於深思熟慮？我們推出首個成功解鎖音頻領域推理能力的Step-Audio-R1模型。透過我們提出的模態錨定推理蒸餾框架，該模型學會生成與音頻特徵真實錨定的推理鏈，而非產生脫離聲學特徵的虛幻推演。我們的模型展現出強大的音頻推理能力，在涵蓋語音、環境音與音樂的綜合音頻理解與推理基準測試中，不僅超越Gemini 2.5 Pro，更達到與頂尖模型Gemini 3 Pro相媲美的性能。這些成果證明，當推理能力被適當錨定時，即可成為跨模態的可遷移能力，使延伸推演從音頻智能的負擔轉化為強大優勢。Step-Audio-R1作為首個成功的音頻推理模型，為建構真正跨感官模態的深度推理系統開闢了新路徑。

English

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

Step-Audio-R1 技術報告

Step-Audio-R1 Technical Report

摘要

Support