Kimi-Audio 技術レポート

要旨

Kimi-Audioを紹介します。これは、音声理解、生成、会話に優れたオープンソースの音声基盤モデルです。Kimi-Audioの構築における実践を詳細に説明し、モデルアーキテクチャ、データキュレーション、トレーニングレシピ、推論デプロイメント、評価について解説します。具体的には、12.5Hzの音声トークナイザーを活用し、連続特徴を入力とし離散トークンを出力とする新しいLLMベースのアーキテクチャを設計し、フローマッチングに基づくチャンク単位のストリーミングデトークナイザーを開発しました。1,300万時間以上の音声データを含む多様なモダリティ（音声、音響、音楽など）をカバーする事前学習データセットをキュレーションし、高品質で多様な事後学習データを構築するパイプラインを構築しました。事前学習済みLLMから初期化されたKimi-Audioは、音声とテキストデータを用いて継続的に事前学習され、その後、多様な音声関連タスクをサポートするためにファインチューニングされます。広範な評価により、Kimi-Audioが音声認識、音声理解、音声質問応答、音声会話などの一連の音声ベンチマークで最先端の性能を達成することが示されています。コード、モデルチェックポイント、評価ツールキットをhttps://github.com/MoonshotAI/Kimi-Audioで公開しています。

English

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

Kimi-Audio 技術レポート

Kimi-Audio Technical Report

要旨

Support