asr

#asr

Phonetic Modeling of Dialectal Variation in Vietnamese Speech

arXiv cs.CL ↗ · 2026-05-26 Cached

This paper proposes a dialect-aware phonetic framework for modeling phonetic variation in Vietnamese ASR, decomposing syllables into structured components and mapping them to dialect-specific IPA representations. The approach matches pretrained baselines with fewer parameters and no external pretraining on the UIT-ViMD multi-dialect dataset.

0 favorites 0 likes

#asr

@MaxForAI: If you are working on voice agents, you should try this project. A team from NTU, NUS, and Shanghai AI Lab released: Mega-ASR. This fully open-source ASR is built on Qwen3-ASR, aiming to break the long-standing bottleneck of ASR performance in noisy, reverberant, or other impaired real-world environments...

X AI KOLs Timeline ↗ · 2026-05-22 Cached

NTU, NUS, and Shanghai AI Lab jointly released Mega-ASR, a fully open-source ASR model built on Qwen3-ASR. Using the Voices-in-the-Wild-2M dataset and progressive acoustic-to-semantic optimization, it achieves up to 30% relative Word Error Rate (WER) reduction in real-world noisy environments. With only 1.7B parameters, it enables efficient inference on consumer-grade hardware.

0 favorites 0 likes

#asr

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

Hugging Face Daily Papers ↗ · 2026-05-22 Cached

This paper introduces CLD, a lightweight convex optimization-based language detection head for ASR that achieves 97-98% accuracy with under 100 training samples while reducing compute costs by 13x, addressing accent and dialect robustness across 5 languages and 24 sub-dialects.

0 favorites 0 likes

#asr

StepAudio 2.5 Technical Report

Hugging Face Daily Papers ↗ · 2026-05-22 Cached

StepAudio 2.5 is a unified audio-language model that achieves state-of-the-art results across ASR, TTS, and real-time spoken interaction by leveraging task-tailored reinforcement learning from human feedback to optimize shared representations.

0 favorites 0 likes

#asr

@AdinaYakup: Mega-ASR https://huggingface.co/zhifeixie/Mega-ASR… 1.7B Apache 2.0 Built for Noise/Reverb/Clipping/Overlapping speaker…

X AI KOLs Following ↗ · 2026-05-21 Cached

Mega-ASR is a 1.7B parameter robust ASR model under Apache 2.0, designed for noisy, reverberant, and overlapping speech, with an audio quality router to handle clean vs degraded audio.

0 favorites 0 likes

#asr

SCRIBE: Diagnostic Evaluation and Rich Transcription Models for Indic ASR

arXiv cs.CL ↗ · 2026-05-21 Cached

SCRIBE is a diagnostic evaluation framework for automatic speech recognition that provides categorical error decomposition for Indic languages, releasing benchmarks and open-weight rich transcription models for Hindi, Malayalam, and Kannada.

0 favorites 0 likes

#asr

@XieZhifei14110: Stop using Whisper for ASR ! open sourcing Mega-ASR — the first full-scenario SOTA industrial-grade ASR model, built fo…

X AI KOLs Timeline ↗ · 2026-05-20 Cached

Open sourcing Mega-ASR, a full-scenario SOTA industrial-grade ASR model designed for challenging audio conditions like far-field and noise, outperforming existing open and closed models by 10-30% on real-world benchmarks.

0 favorites 0 likes

#asr

@gkxspace: I spend two to three thousand on AI subscriptions every month, some for TTS, ASR, etc. The mainstream ones are expensive and their API protocols differ. I kept thinking: is there a single plan that covers voice cloning, meeting transcription, AI podcast generation, real-time voice Q&A, voice input, and coding? Finally found a godsend—StepFun's S...

X AI KOLs Timeline ↗ · 2026-05-20 Cached

StepFun launches Step Plan subscription at $6.99/month, integrating LLM, TTS, ASR, image generation, and other AI models. Supports direct OpenAI SDK connection, applicable for voice cloning, meeting transcription, AI podcast generation, etc.

0 favorites 0 likes

#asr

Benchmarking Commercial ASR Systems on Code-Switching Speech: Arabic, Persian, and German

arXiv cs.CL ↗ · 2026-05-20 Cached

This paper presents a benchmark evaluating five commercial ASR systems on code-switching speech across Arabic-English, Persian-English, and German-English pairs, using a two-stage pipeline to select 300 samples per pair and assessing performance with WER and BERTScore. ElevenLabs Scribe v2 achieves the lowest overall WER (13.2%) and highest BERTScore (0.936), with public dataset available.

0 favorites 0 likes

#asr

Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation

Hugging Face Daily Papers ↗ · 2026-05-19 Cached

Mega-ASR proposes scaling up real-world acoustic simulation to improve automatic speech recognition in challenging, wild conditions, aiming to narrow the performance gap between lab and real-world settings.

0 favorites 0 likes

#asr

nvidia/nemotron-3.5-asr-streaming-0.6b

Hugging Face Models Trending ↗ · 2026-05-15 Cached

NVIDIA releases Nemotron 3.5 ASR, a 600M parameter multilingual streaming speech recognition model supporting 40 language-locales with a Cache-Aware FastConformer-RNNT architecture for low-latency transcription. The model supports configurable chunk sizes and is ready for commercial use under the OpenMDW-1.1 license.

0 favorites 0 likes

#asr

@FeitengLi: Actually, these problems can be well solved: 1. Ditch whisper, switch to an ASR model. Qwen3-ASR is great with few hallucinations, and there are other ASR options. Whisper has many hallucinations and requires 30s segments. Qwen3-ASR gets more accurate with longer audio, supporting up to 20…

X AI KOLs Timeline ↗ · 2026-05-15 Cached

Recommends using Qwen3-ASR instead of Whisper to reduce hallucinations, using LattifAI tools for precise audio-text alignment and subtitle generation, and introducing their own OmniVAD-Kit project for voice activity detection.

0 favorites 0 likes

#asr

@aigclink: An open-source end-to-end video translation + video Q&A Skill: violin. The highlight is not just literal translation, but the idea of content re-creation. It integrates ASR, LLM translation, and TTS into a seamless pipeline video Skill. The three modules are automatically chained: input a video and get a dubbed translated video. Translation style is adjustable, for example...

X AI KOLs Timeline ↗ · 2026-05-15

Violin is an open-source end-to-end video translation and video Q&A tool, integrating ASR, LLM translation, and TTS. It supports style adjustment and content re-creation, and can answer questions about video content.

0 favorites 0 likes

#asr

A Calculus-Based Framework for Determining Vocabulary Size in End-to-End ASR

arXiv cs.CL ↗ · 2026-05-15 Cached

This paper presents a calculus-based framework that uses first and second derivative tests to estimate the optimal vocabulary size hyper-parameter for end-to-end ASR systems, improving performance on the Librispeech corpus.

0 favorites 0 likes

#asr

@berryxia: Guys, this is awesome! Install it right away! Kevin Lin, postdoc at Oxford, former Meta and Microsoft researcher, just released Violin, an open-source video translation Skill. Video is already the absolute dominant content form on the internet. Yet most high-quality lectures, speeches, and podcasts are locked by a single language…

X AI KOLs Timeline ↗ · 2026-05-15 Cached

Violin is an open-source video translation tool that integrates speech recognition, large language model translation, and text-to-speech. It supports over 30 languages and offers three usage modes: CLI, web app, and Claude Code.

0 favorites 0 likes

#asr

Vividh-ASR: A Complexity-Tiered Benchmark and Optimization Dynamics for Robust Indic Speech Recognition

Hugging Face Daily Papers ↗ · 2026-05-13 Cached

Introduces Vividh-ASR, a complexity-tiered benchmark for Hindi and Malayalam ASR, identifies studio-bias in fine-tuning, and proposes R-MFT to improve spontaneous speech performance efficiently.

0 favorites 0 likes

#asr

Dolphin-CN-Dialect: Where Chinese Dialects Matter

arXiv cs.CL ↗ · 2026-05-12 Cached

Dolphin-CN-Dialect is a streaming-capable ASR model that improves dialect recognition through temperature-based sampling and redesigned tokenization, achieving competitive performance with a smaller model size.

0 favorites 0 likes

#asr

Adding Benchmaxxer Repellant to the Open ASR Leaderboard

Hugging Face Blog ↗ · 2026-05-06 Cached

Hugging Face announces the addition of private, high-quality datasets from Appen and DataoceanAI to the Open ASR Leaderboard to prevent benchmaxxing and test-set contamination, while maintaining public data for the default average WER calculation.

0 favorites 0 likes

#asr

Voice of India: A Large-Scale Benchmark for Real-World Speech Recognition in India

arXiv cs.CL ↗ · 2026-04-22 Cached

Researchers introduce Voice of India, a 536-hour closed benchmark of unscripted telephonic conversations across 15 Indian languages and 139 regional clusters, exposing geographic and demographic ASR performance disparities.

0 favorites 0 likes

#asr

@aigclink: Alibaba Tongyi Lab just dropped Fun-ASR 1.5—one industrial-grade model handles 30 languages, all 7 major Chinese dialect families + 20+ regional accents, even classical-poetry recitation. Dialect CER down 56.2 % vs last gen; 5 dialects top 90 % accuracy…

X AI KOLs Timeline ↗ · 2026-04-20 Cached

Alibaba Tongyi Lab releases Fun-ASR 1.5: a single model covering 30 languages, seven Chinese dialect groups and 20+ local accents; character-error rate in key dialect scenarios falls 56.2 %, with five dialects exceeding 90 % accuracy.

0 favorites 0 likes

asr

Submit Feedback