speech-processing

#speech-processing

A Survey of Automated Presentation Coaching: Systems, Methods, and Open Challenges

arXiv cs.CL ↗ · 2d ago Cached

A survey of automated presentation coaching systems that reviews existing systems, introduces a five-dimensional task taxonomy covering pronunciation, stress, prosody, pacing, and content faithfulness, and identifies open challenges such as annotation scarcity, accent fairness, and low-latency feedback.

0 favorites 0 likes

#speech-processing

@jreuben1: Speech and Language Processing (3rd ed. draft) Dan Jurafsky and James H. Martin https://web.stanford.edu/~jurafsky/slp3…

X AI KOLs Following ↗ · 2026-06-19 Cached

The Jan 6, 2026 draft of the 3rd edition of 'Speech and Language Processing' by Dan Jurafsky and James H. Martin is released, featuring a revised structure with a focus on large language models and updated chapters.

0 favorites 0 likes

#speech-processing

hubert.cpp, a C++ implementation of distilHuBERT [P]

Reddit r/MachineLearning ↗ · 2026-06-12

A C++ implementation of distilHuBERT with no runtime dependencies, compiled-in weights, dynamic sizing, and on-par performance with ONNX Runtime, designed for easy integration into CMake projects.

0 favorites 0 likes

#speech-processing

The Dynamics of Human and AI-Generated Language: How Semantics Fluctuates across Different Timescales

arXiv cs.CL ↗ · 2026-06-11 Cached

This paper introduces a semantic-timescale analysis pipeline to study how generic vs. specific content is distributed over time in human and AI-generated speech, revealing that autocorrelation-window measures capture temporal organization of semantics beyond static lexical distributions.

0 favorites 0 likes

#speech-processing

Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming

arXiv cs.CL ↗ · 2026-06-10 Cached

A novel method for multilingual word-level forced alignment combines self-supervised representations from MMS and a phoneme boundary detector with a learned dynamic programming decoder, outperforming existing aligners on English and unseen languages without further training.

0 favorites 0 likes

#speech-processing

InfoShield: Privacy-Preserving Speech Representations for Mental Health Screening via Information-Theoretic Optimization

arXiv cs.CL ↗ · 2026-06-05 Cached

InfoShield introduces a privacy-preserving method for speech representations in mental health screening using information-theoretic optimization, reducing sensitive attribute inference while maintaining diagnostic accuracy. A novel TimeAwareMINE estimator addresses temporal-static misalignment in sequential speech.

0 favorites 0 likes

#speech-processing

@HarshalsinghCN: I built an open-source Hinglish TTS that beats every model on the market. I had zero research background. last week I w…

X AI KOLs Timeline ↗ · 2026-05-12 Cached

A developer documents building an open-source Hinglish text-to-speech system that outperforms existing models by fixing upstream inference bugs and adding a lightweight preprocessing wrapper, achieving high quality without training or GPU resources.

0 favorites 0 likes

#speech-processing

@QingQ77: Training a 0.1B end-to-end omnimodal model from scratch. A single set of weights handles text, speech, and image inputs, while outputting text and streaming speech. https://github.com/jingyaogong/minimind-o… MiniMind-O is an omnimodal model with only 0.1B parameters…

X AI KOLs Timeline ↗ · 2026-05-09 Cached

MiniMind-O has released an end-to-end omnimodal model with only 0.1B parameters, supporting text, speech, and image inputs as well as streaming speech output. The project opensources the code, weights, training data, and technical report, emphasizing that both training and inference can be performed quickly on standard GPUs.

0 favorites 0 likes

#speech-processing

MultiLinguahah : A New Unsupervised Multilingual Acoustic Laughter Segmentation Method

arXiv cs.CL ↗ · 2026-05-08 Cached

This paper introduces MultiLinguahah, an unsupervised multilingual method for acoustic laughter segmentation using Isolation Forests on BYOL-A encoder representations. The authors demonstrate that their approach outperforms state-of-the-art supervised methods in non-English settings by treating laughter detection as an anomaly detection task.

0 favorites 0 likes

#speech-processing

PersonaKit (PK): A Plug-and-Play Platform for User Testing Diverse Roles in Full-Duplex Dialogue

arXiv cs.CL ↗ · 2026-05-08 Cached

PersonaKit is an open-source web platform designed for rapid prototyping and user testing of diverse personas in full-duplex dialogue systems. It allows researchers to configure persona-specific turn-taking behaviors via JSON and conduct A/B surveys to evaluate sociolinguistic interactions.

0 favorites 0 likes

#speech-processing

Aligning Backchannel and Dialogue Context Representations via Contrastive LLM Fine-Tuning

arXiv cs.CL ↗ · 2026-04-21 Cached

Researchers from KTH Royal Institute of Technology propose a two-stage framework that fine-tunes LLMs on dialogue transcripts and uses contrastive learning to create joint embeddings for aligning backchannel signals with conversational context, demonstrating improved context-backchannel retrieval compared to previous methods.

0 favorites 0 likes

#speech-processing

easyaligner: Forced alignment with GPU acceleration and flexible text normalization (compatible with all w2v2 models on HF Hub) [P]

Reddit r/MachineLearning ↗ · 2026-04-18

easyaligner is an open-source forced alignment library with GPU acceleration and flexible text normalization that works with all wav2vec2 models on Hugging Face Hub. It addresses practical workflows like handling partial transcripts, irrelevant speech segments, and long audio without chunking while preserving original text formatting.

0 favorites 0 likes

speech-processing

Submit Feedback