Speech-Driven End-to-End Language Discrimination towards Chinese Dialects

arXiv cs.CL Papers

Summary

This paper investigates speech-driven features for fine-grained discrimination among Chinese dialects, using an end-to-end model that combines MFCC-based features with word-level embeddings via a CNN, outperforming text-driven methods.

arXiv:2606.18584v1 Announce Type: new Abstract: Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task. The traditional text-driven focus leads to poor results. In this paper, we explore the effectiveness of speech-driven features towards language discrimination among Chinese dialects. First, we systematically explore the appropriateness of speech-driven MFCC features towards CNN-based language discrimination. Then, we design an end-to-end speech recognition model based on HMM-DNN to predict Chinese dialect words. We adopt attention to extract the discriminative words related to different Chinese dialects. Finally, through a CNN, we combine the word-level embedding and the MFCC-based features. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech-driven approach to fine-grained Chinese dialect discrimination compared to the state-of-the-art methods.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:45 AM

# Speech-Driven End-to-End Language Discrimination towards Chinese Dialects
Source: [https://arxiv.org/abs/2606.18584](https://arxiv.org/abs/2606.18584)
[View PDF](https://arxiv.org/pdf/2606.18584)

> Abstract:Language discrimination among similar languages, varieties, and dialects is a challenging natural language processing task\. The traditional text\-driven focus leads to poor results\. In this paper, we explore the effectiveness of speech\-driven features towards language discrimination among Chinese dialects\. First, we systematically explore the appropriateness of speech\-driven MFCC features towards CNN\-based language discrimination\. Then, we design an end\-to\-end speech recognition model based on HMM\-DNN to predict Chinese dialect words\. We adopt attention to extract the discriminative words related to different Chinese dialects\. Finally, through a CNN, we combine the word\-level embedding and the MFCC\-based features\. Evaluation of two benchmark Chinese dialect corpora shows the appropriateness and effectiveness of the proposed speech\-driven approach to fine\-grained Chinese dialect discrimination compared to the state\-of\-the\-art methods\.

## Submission history

From: Fan Xu \[[view email](https://arxiv.org/show-email/7f279a82/2606.18584)\] **\[v1\]**Wed, 17 Jun 2026 01:23:58 UTC \(1,045 KB\)

Similar Articles

Dolphin-CN-Dialect: Where Chinese Dialects Matter

arXiv cs.CL

Dolphin-CN-Dialect is a streaming-capable ASR model that improves dialect recognition through temperature-based sampling and redesigned tokenization, achieving competitive performance with a smaller model size.

Side-by-side Comparison Amplifies Dialect Bias in Language Models

arXiv cs.CL

This research paper finds that language models exhibit increased dialect bias when comparing Standard American English and African-American Vernacular English side-by-side, even after safety fine-tuning. Counterfactual fairness fine-tuning can reduce some biases in isolation but not consistently in contrastive settings.