ChildVox：一个用于理解和表征童年时期声音的语音、音频及大型音频语言模型基准

摘要

我们提出了ChildVox，这是一个新颖的基准测试，用于刻画儿童通过多种声音信号进行交流的特点。具体而言，ChildVox 追踪从出生到学龄的完整发展轨迹，涵盖生理声音、非语言发声、规范音节以及口语。ChildVox 整合了来自17个以儿童为中心的音频和语音数据集中的20多个子任务，能够实现系统性的跨语料库和跨领域比较。我们评估了一系列具有代表性的音频和语音基础模型，包括自监督模型、面向自动语音识别的模型以及大型音频语言模型，任务包括生理声音分类、发声与规范音节建模、以及语音质量评估与识别。基准测试结果表明，ChildVox 提供了一套高性能模型，能够识别儿童发出的多种声学信号，支持刻画儿童语言水平以及追踪随年龄增长的语音产出等下游应用。

English

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.