ChildVox:一个用于理解和表征童年时期声音的语音、音频及大型音频语言模型基准
ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
May 28, 2026
作者: Tiantian Feng, Anfeng Xu, Xuan Shi, Aditya Kommineni, Shakhrul Iman Siam, Megan Micheletti, Zhonghao Shi, Helen Tager-Flusberg, Mi Zhang, Lynn K. Perry, Catherine Lord, Daniel Messinger, Shrikanth Narayanan
cs.AI
摘要
我们提出了ChildVox,这是一个新颖的基准测试,用于刻画儿童通过多种声音信号进行交流的特点。具体而言,ChildVox 追踪从出生到学龄的完整发展轨迹,涵盖生理声音、非语言发声、规范音节以及口语。ChildVox 整合了来自17个以儿童为中心的音频和语音数据集中的20多个子任务,能够实现系统性的跨语料库和跨领域比较。我们评估了一系列具有代表性的音频和语音基础模型,包括自监督模型、面向自动语音识别的模型以及大型音频语言模型,任务包括生理声音分类、发声与规范音节建模、以及语音质量评估与识别。基准测试结果表明,ChildVox 提供了一套高性能模型,能够识别儿童发出的多种声学信号,支持刻画儿童语言水平以及追踪随年龄增长的语音产出等下游应用。
English
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.