ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
Summary
ChildVox presents a comprehensive benchmark for analyzing children's acoustic communication across developmental stages, integrating over 20 sub-tasks from 17 child-centered audio and speech datasets.
View Cached Full Text
Cached at: 05/29/26, 07:00 AM
Paper page - ChildVox: A Speech, Audio, and Large Audio-Language Model Benchmark in Understanding and Characterizing Sound across Childhood
Source: https://huggingface.co/papers/2605.29257 Authors:
,
,
,
,
,
,
,
,
,
,
,
Abstract
ChildVox presents a comprehensive benchmark for analyzing children’s acoustic communication across developmental stages using diverse audio and speech models.
We present ChildVox, a novel benchmark for characterizing the diverse acoustic signals through which children communicate. Specifically, ChildVox follows the full developmental trajectory from birth through school age, covering physiological sounds, non-linguistic vocalizations, canonical syllables, and spoken language. ChildVox integrates more than 20 sub-tasks across 17 child-centered audio and speech datasets, enabling systematic cross-corpus and cross-domain comparison. We evaluate a representative range ofaudio and speech foundation models, including self-supervised, ASR-oriented, andlarge audio-language models, on tasks includingphysiological sound classification, vocalization andcanonical syllables modeling, andspeech quality assessmentand recognition. Benchmark results show that ChildVox provides a suite of high-performance models in recognizing a wide range of acoustic signals from children, supporting downstream applications such as characterizing children’s language levels and tracking speech production with age.
View arXiv pageView PDFProject pageAdd to collection
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.29257 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.29257 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.29257 in a Space README.md to link it from this page.
Collections including this paper1
Similar Articles
OpenBMB/VoxCPM
OpenBMB releases VoxCPM2, a 2B-parameter tokenizer-free TTS model trained on 2M+ hours of multilingual speech data, supporting 30 languages, voice design, controllable cloning, and 48kHz output.
openbmb/VoxCPM2
VoxCPM2 is an open-source, tokenizer-free diffusion autoregressive Text-to-Speech model supporting 30 languages with 2B parameters, 48kHz audio output, and features including voice design from natural language descriptions, controllable voice cloning, and real-time streaming capabilities.
Tested out VoxCPM2 (Open-Source TTS) locally. The "Ultimate Cloning" mode capturing breathing/accents is getting insane.
Technical breakdown and benchmarks of VoxCPM2, an open-source TTS model featuring Ultimate Cloning Mode for capturing breathing and accents, tested locally with low VRAM footprint and cross-lingual accent retention.
Audio-Visual Intelligence in Large Foundation Models
This survey paper provides a comprehensive review of audio-visual intelligence within large foundation models, establishing a unified taxonomy, synthesizing core methodologies, and outlining key datasets, benchmarks, and open research challenges.
KoALa-Bench: Evaluating Large Audio Language Models on Korean Speech Understanding and Faithfulness
KoALa-Bench introduces a Korean-focused benchmark suite for evaluating large audio language models on six tasks, including novel measures of speech faithfulness and Korea-specific cultural content.