ChildVox: 아동기 전반의 소리 이해 및 특성화를 위한 음성, 오디오 및 대규모 오디오-언어 모델 벤치마크

초록

아동이 소통하는 다양한 음향 신호를 특성화하기 위한 새로운 벤치마크인 ChildVox를 제시한다. 구체적으로, ChildVox는 출생부터 학령기까지의 전체 발달 궤적을 따라 생리학적 소리, 비언어적 발성, 정규 음절, 그리고 구어를 포괄한다. ChildVox는 17개의 아동 중심 오디오 및 음성 데이터셋에 걸쳐 20개 이상의 하위 과제를 통합하여 체계적인 말뭉치 간 및 도메인 간 비교를 가능하게 한다. 우리는 자기지도, ASR 지향, 대규모 오디오-언어 모델을 포함한 대표적인 오디오 및 음성 기초 모델들을 생리학적 소리 분류, 발성 및 정규 음절 모델링, 음성 품질 평가 및 인식 과제에서 평가한다. 벤치마크 결과는 ChildVox가 아동의 다양한 음향 신호를 인식하는 데 있어 고성능 모델 모음을 제공하며, 아동의 언어 수준 특성화 및 연령에 따른 발화 생성 추적과 같은 하위 응용을 지원함을 보여준다.

English

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.