Kimi-Audio 기술 보고서

초록

우리는 오디오 이해, 생성, 대화 분야에서 탁월한 성능을 보이는 오픈소스 오디오 기반 모델인 Kimi-Audio를 소개합니다. 본 논문에서는 Kimi-Audio의 구축 과정을 모델 아키텍처, 데이터 큐레이션, 학습 레시피, 추론 배포, 평가 등으로 상세히 설명합니다. 구체적으로, 12.5Hz 오디오 토크나이저를 활용하고, 연속적인 특징을 입력으로, 이산 토큰을 출력으로 하는 새로운 LLM 기반 아키텍처를 설계하며, 플로우 매칭 기반의 청크 단위 스트리밍 디토크나이저를 개발했습니다. 1,300만 시간 이상의 오디오 데이터로 구성된 사전 학습 데이터셋을 큐레이션하였으며, 이는 음성, 소리, 음악 등 다양한 모달리티를 포함합니다. 또한, 고품질이고 다양한 사후 학습 데이터를 구축하기 위한 파이프라인을 구축했습니다. 사전 학습된 LLM으로 초기화된 Kimi-Audio는 오디오와 텍스트 데이터를 대상으로 여러 신중하게 설계된 작업을 통해 지속적으로 사전 학습되었으며, 이후 다양한 오디오 관련 작업을 지원하기 위해 미세 조정되었습니다. 광범위한 평가 결과, Kimi-Audio는 음성 인식, 오디오 이해, 오디오 질의응답, 음성 대화 등 다양한 오디오 벤치마크에서 최첨단 성능을 달성함을 보여줍니다. 코드, 모델 체크포인트, 평가 툴킷을 https://github.com/MoonshotAI/Kimi-Audio에서 공개합니다.

English

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

Kimi-Audio 기술 보고서

Kimi-Audio Technical Report

초록

Support