speech-recognition

#speech-recognition

Compressed Whisper large-v3-turbo to 368 MB with Q3_K-matched QAT — multilingual WER results

Reddit r/openclaw ↗ · 21h ago

Whisper large-v3-turbo has been compressed to 368 MB using Q3_K-matched quantization-aware training, with multilingual word error rate results reported.

0 favorites 0 likes

#speech-recognition

Whisperian: It is one of the best applications for Android, if you want to use Mic with some local ASR models. And it is also available on Play Store.

Reddit r/LocalLLaMA ↗ · yesterday

Whisperian is an Android application that enables users to use a microphone with local automatic speech recognition (ASR) models, and it is available on the Play Store.

0 favorites 0 likes

#speech-recognition

Low Resource Multimodal Translation of Nepali Spoken Words into Emotion-Conditioned Sign Language Avatars

arXiv cs.CL ↗ · 3d ago Cached

This paper presents NEST-V1, a proof-of-concept multimodal framework for generating emotion-conditioned Nepali Sign Language avatars from spoken input, achieving 81.1% ASR accuracy and 79.21% emotion recognition accuracy on a dataset of 600 audio samples from 50 speakers.

0 favorites 0 likes

#speech-recognition

Error-Aware TF-IDF Retrieval-Augmented Generation for ASR Error Correction

arXiv cs.CL ↗ · 4d ago Cached

This paper proposes an error-aware TF-IDF retrieval-augmented generation framework for correcting automatic speech recognition errors, achieving significant accuracy gains on Persian FLEURS with near-zero inference latency.

0 favorites 0 likes

#speech-recognition

Data Scale, Not Latency, Shapes Cross-Lingual Encoder Transfer in Streaming ASR

arXiv cs.AI ↗ · 5d ago Cached

This paper investigates the impact of data scale versus latency on cross-lingual transfer for streaming ASR, finding that multilingual initialization benefits are data-limited, not latency-limited, and diminish as target-language data increases.

0 favorites 0 likes

#speech-recognition

Layer-wise Probing of wav2vec 2.0 and Whisper for Consonant Cluster Reduction in African American English

arXiv cs.CL ↗ · 5d ago Cached

This paper uses layer-wise probing to investigate how wav2vec 2.0 and Whisper encode consonant cluster reduction in African American English, finding that both models distinguish reduced and canonical forms and preserve cues to underlying stops.

0 favorites 0 likes

#speech-recognition

Introducing the FFASR Leaderboard: Benchmarking ASR in the Real World

Hugging Face Blog ↗ · 5d ago Cached

Introduces the FFASR Leaderboard, an open, community-driven benchmark for evaluating automatic speech recognition models under realistic far-field acoustic conditions, highlighting the significant performance gap between near-field and far-field scenarios.

0 favorites 0 likes

#speech-recognition

@DataChaz: @NVIDIA just quietly dropped an incredibly impressive speech recognition model that completely changes the math for loc…

X AI KOLs Timeline ↗ · 6d ago Cached

NVIDIA quietly released Nemotron-3.5-ASR, a lightweight 0.6B parameter open-source speech recognition model designed for real-time streaming with support for 40+ languages, low latency, and cache-aware architecture.

0 favorites 0 likes

#speech-recognition

@FeitengLi: Led by Fable 5 (just half a day), Codex relay development took a week. #EdgeSpeak is now live. Friends who shared, contact me to receive an invite code https://edgespeak.com/zh

X AI KOLs Timeline ↗ · 2026-06-21 Cached

EdgeSpeak desktop voice transcription tool is now live, featuring the local Lattice-2 voice model. It supports offline audio/video transcription, multiple languages and accents, and provides a local API for developers to integrate.

1 favorites 0 likes

#speech-recognition

@AndrewYNg: New course: Add voice to your AI agents and applications, built with @VocalBridge (disclosure: an AI Fund portfolio com…

X AI KOLs Following ↗ · 2026-06-18 Cached

Andrew Ng announces a new course on adding voice to AI agents using VocalBridge, taught by its CEO. The course covers three integration patterns and evaluation techniques for building reliable and low-latency voice applications.

0 favorites 0 likes

#speech-recognition

ASTRA: A Scalable Next-Generation ATCO Training Simulator with Autonomous Simpilots

arXiv cs.LG ↗ · 2026-06-18 Cached

ASTRA is an end-to-end training simulator for air traffic control operators that automates sim pilot roles using locally adapted speech models, achieving a significant reduction in word error rates for Singaporean-accented aviation speech and incorporating AI-assisted performance evaluation.

0 favorites 0 likes

#speech-recognition

@mudler_it: parakeet.cpp now runs NVIDIA Parakeet behind the OpenAI API. Point any OpenAI client at a local server, send an audio, …

X AI KOLs Timeline ↗ · 2026-06-17 Cached

parakeet.cpp enables running NVIDIA Parakeet ASR behind the OpenAI API locally with prebuilt Docker images, supporting CPU and CUDA (including arm64) for real-time transcription with word timestamps.

0 favorites 0 likes

#speech-recognition

Improving low-resource ASR using bilingual fine-tuning with language identification: a cross-linguistic evaluation

arXiv cs.CL ↗ · 2026-06-17 Cached

This study evaluates bilingual fine-tuning with language identification tokens for improving ASR in low-resource languages across nine diverse language pairs, finding that high LID accuracy is beneficial and that providing the LID token at inference can boost performance when LID accuracy is low.

0 favorites 0 likes

#speech-recognition

Voice agents in noisy environments

Reddit r/AI_Agents ↗ · 2026-06-16

A speech company trained a model that cancels noise and identifies the primary speaker, achieving 50% lower word error rate on leading ASR models in noisy environments.

0 favorites 0 likes

#speech-recognition

MoDiCoL: A Modular Diagnostic Continual Learning Dataset for Robust Speech Recognition

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper introduces MoDiCoL, a modular diagnostic continual learning dataset for robust speech recognition, enabling controlled analysis of linguistic content, speaker characteristics, and acoustic environments, and proposes a continual learning curriculum to study how robustness is acquired, transferred, and forgotten.

0 favorites 0 likes

#speech-recognition

Learning to Hear Hesitation: Continual Learning for Disfluency-Aware ASR

arXiv cs.CL ↗ · 2026-06-15 Cached

This paper proposes a continual learning approach to integrate disfluency tokens into pretrained ASR models, addressing catastrophic forgetting and improving recognition of disfluent speech.

0 favorites 0 likes

#speech-recognition

@PyTorch: In this clip from his PyTorch Conference Europe 2026 keynote, Patrick von Platen (@MistralAI) discusses why real-world …

X AI KOLs Following ↗ · 2026-06-12 Cached

At PyTorch Conference Europe 2026, Mistral AI's Patrick von Platen explains why real-world AI interaction requires streaming architectures that process continuous input and produce continuous output, using Vox Real Time as a live transcription example.

0 favorites 0 likes

#speech-recognition

@tom_doerr: Transcribes audio at 70x real-time speed https://github.com/m-bain/whisperX

X AI KOLs Timeline ↗ · 2026-06-12 Cached

WhisperX is a tool for fast automatic speech recognition with word-level timestamps and speaker diarization, offering 70x realtime transcription using Whisper large-v2.

0 favorites 0 likes

#speech-recognition

Revi

Product Hunt ↗ · 2026-06-12

Revi is a voice dictation app that runs on-device without needing cloud services or an account.

0 favorites 0 likes

#speech-recognition

How I implemented ASR bias for voice transcription models [Open Source]

Reddit r/LocalLLaMA ↗ · 2026-06-11

The article explains how to implement ASR biasing for voice transcription models, using examples from Groq and local models, and introduces the open-source Freestyle project that incorporates this feature.

0 favorites 0 likes

speech-recognition

Submit Feedback