Tag
A survey of automated presentation coaching systems that reviews existing systems, introduces a five-dimensional task taxonomy covering pronunciation, stress, prosody, pacing, and content faithfulness, and identifies open challenges such as annotation scarcity, accent fairness, and low-latency feedback.
The Jan 6, 2026 draft of the 3rd edition of 'Speech and Language Processing' by Dan Jurafsky and James H. Martin is released, featuring a revised structure with a focus on large language models and updated chapters.
A C++ implementation of distilHuBERT with no runtime dependencies, compiled-in weights, dynamic sizing, and on-par performance with ONNX Runtime, designed for easy integration into CMake projects.
This paper introduces a semantic-timescale analysis pipeline to study how generic vs. specific content is distributed over time in human and AI-generated speech, revealing that autocorrelation-window measures capture temporal organization of semantics beyond static lexical distributions.
A novel method for multilingual word-level forced alignment combines self-supervised representations from MMS and a phoneme boundary detector with a learned dynamic programming decoder, outperforming existing aligners on English and unseen languages without further training.
InfoShield introduces a privacy-preserving method for speech representations in mental health screening using information-theoretic optimization, reducing sensitive attribute inference while maintaining diagnostic accuracy. A novel TimeAwareMINE estimator addresses temporal-static misalignment in sequential speech.
A developer documents building an open-source Hinglish text-to-speech system that outperforms existing models by fixing upstream inference bugs and adding a lightweight preprocessing wrapper, achieving high quality without training or GPU resources.
MiniMind-O has released an end-to-end omnimodal model with only 0.1B parameters, supporting text, speech, and image inputs as well as streaming speech output. The project opensources the code, weights, training data, and technical report, emphasizing that both training and inference can be performed quickly on standard GPUs.
This paper introduces MultiLinguahah, an unsupervised multilingual method for acoustic laughter segmentation using Isolation Forests on BYOL-A encoder representations. The authors demonstrate that their approach outperforms state-of-the-art supervised methods in non-English settings by treating laughter detection as an anomaly detection task.
PersonaKit is an open-source web platform designed for rapid prototyping and user testing of diverse personas in full-duplex dialogue systems. It allows researchers to configure persona-specific turn-taking behaviors via JSON and conduct A/B surveys to evaluate sociolinguistic interactions.
Researchers from KTH Royal Institute of Technology propose a two-stage framework that fine-tunes LLMs on dialogue transcripts and uses contrastive learning to create joint embeddings for aligning backchannel signals with conversational context, demonstrating improved context-backchannel retrieval compared to previous methods.
easyaligner is an open-source forced alignment library with GPU acceleration and flexible text normalization that works with all wav2vec2 models on Hugging Face Hub. It addresses practical workflows like handling partial transcripts, irrelevant speech segments, and long audio without chunking while preserving original text formatting.