ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Hugging Face Daily Papers Papers

Summary

ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages, integrating over 20 sub-tasks from 17 child-centered audio and speech datasets.

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range of audio and speech foundation models, including self-supervised, ASR-oriented, and large audio-language models, on tasks including physiological sound classification, vocalization and canonical syllables modeling, and speech quality assessment and recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children's language levels and tracking speech production with age.
Original Article
View Cached Full Text

Cached at: 05/29/26, 07:00 AM

Paper page - ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood

Source: https://huggingface.co/papers/2605.29257 Authors:

,

,

,

,

,

,

,

,

,

,

,

Abstract

ChildVox presents a comprehensive benchmark for analyzing children’s acoustic communication across developmental stages using diverse audio and speech models.

We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range ofaudio and speech foundation models, including self-supervised, ASR-oriented, andlarge audio-language models, on tasks includingphysiological sound classification, vocalization andcanonical syllables modeling, andspeech quality assessmentand recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children’s language levels and tracking speech production with age.

View arXiv pageView PDFProject pageAdd to collection

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.

Collections including this paper1

Similar Articles

OpenBMB/VoxCPM

GitHub Trending (daily)

OpenBMB releases VoxCPM2, a 2B-parameter tokenizer-free TTS model trained on 2M+ hours of multilingual speech data, supporting 30 languages, voice design, controllable cloning, and 48kHz output.

openbmb/VoxCPM2

Hugging Face Models Trending

VoxCPM2 is an open-source, tokenizer-free diffusion autoregressive Text-to-Speech model supporting 30 languages with 2B parameters, 48kHz audio output, and features including voice design from natural language descriptions, controllable voice cloning, and real-time streaming capabilities.

Audio-Visual Intelligence in Large Foundation Models

Hugging Face Daily Papers

This survey paper provides a comprehensive review of audio-visual intelligence within large foundation models, establishing a unified taxonomy, synthesizing core methodologies, and outlining key datasets, benchmarks, and open research challenges.