Kimi-音频技术报告

摘要

我们推出Kimi-Audio，这是一款开源的音频基础模型，在音频理解、生成及对话方面表现卓越。本文详细阐述了构建Kimi-Audio的实践过程，涵盖模型架构、数据整理、训练方案、推理部署及评估方法。具体而言，我们采用12.5Hz的音频分词器，设计了一种新颖的基于大语言模型（LLM）的架构，该架构以连续特征为输入、离散标记为输出，并开发了基于流匹配的分块流式解码器。我们精心策划了一个预训练数据集，包含超过1300万小时的音频数据，覆盖语音、声音和音乐等多种模态，并构建了高质量、多样化的后训练数据管道。Kimi-Audio从预训练的LLM初始化，通过一系列精心设计的任务在音频和文本数据上进行持续预训练，随后微调以支持多种音频相关任务。广泛的评估表明，Kimi-Audio在包括语音识别、音频理解、音频问答及语音对话等一系列音频基准测试中均达到了业界领先水平。我们已在https://github.com/MoonshotAI/Kimi-Audio上发布了代码、模型检查点及评估工具包。

English

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.