ChildVox：一個用於理解和表徵童年聲音的語音、音頻與大型音頻語言模型基準

摘要

我們提出ChildVox，這是一個新穎的基準測試，專門用於表徵兒童透過多種聲學訊號進行溝通的特性。具體而言，ChildVox 涵蓋從出生到學齡的完整發展軌跡，包括生理聲音、非語言發聲、典型音節及口語語言。ChildVox 整合了17個以兒童為中心的音訊與語音資料集中的20多項子任務，能夠進行系統性的跨語料庫與跨領域比較。我們針對一系列具代表性的音訊與語音基礎模型進行評估，包括自監督模型、專注於語音辨識（ASR）的模型，以及大型音訊語言模型，任務涵蓋生理聲音分類、發聲與典型音節建模，以及語音品質評估與辨識。基準測試結果顯示，ChildVox 提供了一套高效能模型，能夠辨識兒童的多樣聲學訊號，並支援如表徵兒童語言能力水準及追蹤隨年齡變化的語音產出等下游應用。

English

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.