ChildVox: 幼少期を通じた音の理解と特性評価における音声、オーディオ、および大規模音声音語モデルのベンチマーク

要旨

本稿では、子どもの多様な音響信号を特徴づけるための新規ベンチマーク「ChildVox」を提案する。具体的には、ChildVoxは出生から学齢期までの完全な発達軌跡を追跡し、生理的音、非言語的な発声、標準音節、そして音声言語を網羅する。ChildVoxは17の子どもの音声・発話データセットにわたる20以上のサブタスクを統合し、コーパス間・ドメイン間の体系的な比較を可能にする。我々は、自己教師ありモデル、ASR指向モデル、大規模音声言語モデルを含む代表的な音声・発話基盤モデルを、生理的音の分類、発声・標準音節のモデリング、音声品質評価と認識といったタスクで評価する。ベンチマークの結果、ChildVoxは子どもの多様な音響信号を認識する高性能モデル群を提供し、子どもの言語レベルの特徴づけや年齢に伴う音声発達の追跡といった応用を支援する。

English

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.