I can't believe text normalization is so underdiscussed in streaming text-to-speech [D]

Reddit r/MachineLearning Tools

Summary

Author highlights under-discussed text normalization issues in streaming TTS and shares a vendor benchmark evaluating 1000+ sentences across 31 categories for dates, URLs, acronyms, etc.

Kinda suprises me how little discussion there is around about mistakes in streaming TTS models People look for natural readers, high voice quality, expressive speech. And most models don't look dumb here and fail. They fail when you give them basic stuff like price, dates, URLs, promo codes, phone numbers. So I was looking for some info and found a benchmark that compares commercial real time streaming TTS models in terms of how they pronounce dates, URLs, acronyms, etc. They are checking 1000+ sentences in 31 categories then use Gemini to see how results came out. [https://async-vocie-ai-text-to-speech-normalization-benchmark.static.hf.space/index.html](https://async-vocie-ai-text-to-speech-normalization-benchmark.static.hf.space/index.html) . Looks valid to me. Obviously this is a vendor benchmark so I am not taking it for granted but the focus feels on point. This has been one of the biggest challenges for us in the production.I am curious how you guys deal with it in practice.
Original Article

Similar Articles

dots.tts Technical Report

Hugging Face Daily Papers

dots.tts presents a 2B-parameter continuous autoregressive TTS model trained on multilingual data, achieving state-of-the-art performance on benchmarks like Seed-TTS-Eval with low-latency streaming via CFG-aware MeanFlow distillation. The model, code, and checkpoints are released under Apache 2.0.

BlasBench: An Open Benchmark for Irish Speech Recognition

arXiv cs.CL

BlasBench introduces an open evaluation benchmark for Irish speech recognition with Irish-aware text normalization that preserves linguistic features like fadas, lenition, and eclipsis. The paper benchmarks 12 ASR systems across four architecture families, revealing significant generalization gaps and showing that existing multilingual systems struggle with Irish due to inadequate normalization.