ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Hugging Face Daily Papers 05/28/26, 12:00 AM Papers

speech audio benchmark childhood developmental large-audio-language-model self-supervised

Summary

ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages, integrating over 20 sub-tasks from 17 child-centered audio and speech datasets.

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.

Original Article

View Cached Full Text

Cached at: 05/29/26, 07:00 AM

Paper page - ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Source: https://huggingface.co/papers/2605.29257 Authors:

Abstract

ChildVox presents a comprehensive benchmark for analyzing children’s acoustic communication across developmental stages using diverse audio and speech models.

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range ofaudio and speech foundation models, including self-supervised, ASR-oriented, andlarge audio-language models, on tasks includingphysiological sound classification, vocalization andcanonical syllables modeling, andspeech quality assessmentand recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children’s language levels and tracking speech production with age.

View arXiv page View PDF Project page Add to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.

ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Paper page - ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper1

Similar Articles

OpenBMB/VoxCPM

openbmb/VoxCPM2

Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.

Audio-Visual Intelligence in Large Foundation Models

KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness

Submit Feedback

Similar Articles

Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.

Audio-Visual Intelligence in Large Foundation Models

KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness