Phonetic and semantic analyses of spoken corpora of Beijing and Taiwan Mandarin indicate that the neutral tone is a lexical tone

arXiv cs.CL 06/26/26, 04:00 AM Papers
linguistics mandarin-chinese neutral-tone lexical-tone phonetic-analysis corpus-linguistics gam
Summary
This paper presents a corpus-based study showing that the neutral tone in Mandarin Chinese is a lexical tone with its own tonal target, based on phonetic and semantic analyses of Beijing and Taiwan Mandarin spoken corpora using generalized additive models and contextualized embeddings.
arXiv:2606.26360v1 Announce Type: new Abstract: The neutral, or floating, tone of Mandarin Chinese is a tone with an enigmatic set of properties. It has been described as a reduced tone, or as a tone that sometimes is lexically fixed but that can also be toneless. In two-syllable words, it is found only on the second syllable, but single-syllable words can also have the neutral tone. We present a corpus-based study of the phonetic realization of the neutral tone in spontaneous conversational speech corpora of Beijing Mandarin and Taiwan Mandarin. We show that the neutral tone has its own tonal target, just as the four lexical tones of Mandarin. We also show that disyllabic words with a neutral tone have pitch contours that have a pitch component that depends on the tone on the first syllable, just as has been observed for two-syllable words with a lexical tone on the second syllable (Chuang et al., 2026). Furthermore, words with a floating tone have word-specific pitch signatures, which have also been documented for single-syllable words (Jin et al., 2026) as well as two-syllable words (Lu et al., 2026b). These word-specific pitch signatures are shown to be predictable to some extent from words' contextualized embeddings, as previously reported for lexical tones (Chuang et al., 2026; Lu et al., 2026b). As there is also considerable variability in the realization of lexical tones, we propose that the neutral tone is, in fact, a lexical tone in both Taiwan Mandarin and Beijing Mandarin. We document both similarities and differences in the realization of the floating tone in these two varieties and provide evidence, using contextualized embeddings, that some of the observed differences may arise from differences in the meanings of the words as used in the two corpora.
Original Article
View Cached Full Text
Cached at: 06/26/26, 05:15 AM
# Phonetic and semantic analyses of spoken corpora of Beijing and Taiwan Mandarin indicate that the neutral tone is a lexical tone
Source: [https://arxiv.org/html/2606.26360](https://arxiv.org/html/2606.26360)
Yuxin Lu1, Zhexuan Li2, R\.Harald Baayen3

1Quantitative Linguistics, Eberhard Karls Universität Tübingen, Tübingen, Germany Email: yuxin\.lu@uni\-tuebingen\.de 2Quantitative Linguistics, Eberhard Karls Universität Tübingen, Tübingen, Germany Email: zhexuan\.li@student\.uni\-tuebingen\.de 3Quantitative Linguistics, Eberhard Karls Universität Tübingen, Tübingen, Germany Email: harald\.baayen@uni\-tuebingen\.de

\(June, 2026\)

###### Abstract

The neutral, or floating, tone of Mandarin Chinese is a tone with an enigmatic set of properties\. It has been described as a reduced tone, or as a tone that sometimes is lexically fixed, but that can also be toneless\. In two\-syllable words, it is found only on the second syllable, but single\-syllable words can also have the neutral tone\. We present a corpus\-based study of the phonetic realization of the neutral tone in spontaneous conversational speech corpora of Beijing Mandarin and Taiwan Mandarin\. We show that the neutral tone has its own tonal target, just as the four lexical tones of Mandarin\. We also show that disyllabic words with a neutral tone have pitch contours that have a pitch component that depends on the tone on the first syllable, just as has been observed for two\-syllable words with a lexical tone on the second syllable\[[11](https://arxiv.org/html/2606.26360#bib.bib14)\]\. Furthermore, words with a floating tone have word\-specific pitch signatures, which have also been documented for single\-syllable words\[[25](https://arxiv.org/html/2606.26360#bib.bib13)\]as well as two\-syllable words\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]\. These word\-specific pitch signatures are shown to be predictable to some extent from words’ contextualized embeddings, as previously reported for lexical tones\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[38](https://arxiv.org/html/2606.26360#bib.bib11)\]\. As there is also considerable variability in the realization of lexical tones, we propose that the neutral tone is, in fact, a lexical tone in both Taiwan Mandarin and Beijing Mandarin\. We document both similarities and differences in the realization of the floating tone in these two varieties and provide evidence, using contextualized embeddings, that some of the observed differences may arise from differences in the meanings of the words as used in the two corpora\.

Keywords:neutral tone, floating tone, GAM, Beijing Mandarin, Taiwan Mandarin, word\-specific pitch signatures, contextualized embeddings

## 1Introduction

The tone system of standard Mandarin Chinese comprises four “lexical” tones: a high level tone \(T1\), a rising tone \(T2\), a low dipping tone \(T3\), and a falling tone \(T4\)\. In addition, there is a fifth tone, the neutral tone \(T5\)\. The neutral tone is reported to approach a mid\-low pitch target, with a pitch contour that is largely determined by the preceding lexical tone\[[6](https://arxiv.org/html/2606.26360#bib.bib3)\]\. The phonetic realization of the neutral tone varies across regions, not only with respect to the details of how pitch is modulated, but also with respect to duration and intensity\[[33](https://arxiv.org/html/2606.26360#bib.bib42)\]\.

Recent studies\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[25](https://arxiv.org/html/2606.26360#bib.bib13),[37](https://arxiv.org/html/2606.26360#bib.bib12),[38](https://arxiv.org/html/2606.26360#bib.bib11)\]have documented for unscripted conversational Taiwan Mandarin that words’ pitch contours have a word\-specific component that are tied to their meanings, independently of words’ canonical tones and a wide range of prosodic factors\. These studies used the generalized additive model\[[51](https://arxiv.org/html/2606.26360#bib.bib7), GAM,\]to decompose empirical pitch contours into additive pitch contour components, each of which is tied to a linguistic predictor \(e\.g\., the preceding tone, position in the sentence, or the tone of the following word\)\. GAMs reveal strong evidence for both components tied to tone patterns and for word\-specific pitch contour components\.

\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]investigated the pitch realization of disyllabic words with as a set exemplify all possible 20 tone patterns \(the four tones T1–T4, for the first syllable, and five tones \(T1–T5\) for the second syllable\)\. In their analysis, the neutral tone emerged as a tone in its own right, along with the 4 lexical tones \(T1–T4\)\. Given that the floating tone is described as highly dependent on context, this result is surprising\. The present study therefore examines in further detail the pitch realization of disyllabic words in spontaneous speech, in which the first syllable bears a lexical tone and the second syllable carries a neutral tone\. We address the following questions\.

1. 1\.What is the nature of neutral tone?Do words with a specific initial lexical tone and a final neutral tone have a distinct pitch contour component that differs from the pitch contour components of words with other initial lexical tones? This is what\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]observed for all 20 tone patterns, four of which had a neutral tone on the second syllable\. If this result is replicated for Taiwan Mandarin, and also replicates for another variety of Mandarin, namely, Beijing Mandarin, this would imply that the realization of the neutral tone depends on the preceding tone in its own way\. However, dependence on the preceding tone is also observed for final lexical tones in two\-syllable compounds\[[55](https://arxiv.org/html/2606.26360#bib.bib15), cf\. for laboratory speech\]\. Therefore, if evidence for tone\-pattern specific pitch components for tone\-patterns with a final floating tone replicates, this would argue in favor of the floating tone being no different from a lexical tone\.
2. 2\.Considerable differences have been reported for the realization of the floating tone in Taiwan Mandarin and Beijing Mandarin\. Beijing Mandarin, for instance, has been reported to have a higher prevalence of neutral tones\[[10](https://arxiv.org/html/2606.26360#bib.bib26),[42](https://arxiv.org/html/2606.26360#bib.bib49)\]\.Are these differences visible as differently\-shaped pitch components for the same tone patterns \(e\.g\., T2\-T5\) in Taiwan conversational speech and Beijing conversational speech?Furthermore, is there more variability in the pitch signatures for tone patterns with the neutral tone in Beijing than in Mandarin? In addition, it is possible that there is more variability in the realization of di\-syllabic words with a neutral tone on the second syllable than for di\-syllabic words with a lexical tone\.
3. 3\.Do disyllabic words with a neutral tone exhibit word\-specific pitch contour signatures not only in Taiwan Mandarin, but also in Beijing Mandarin?If so, to what extent are the pitch signatures for a specific word similar across the two dialects? One would expect there to be words with very similar pitch signatures, but for others, their pitch signatures may well be dialect\-specific\.
4. 4\.\[[11](https://arxiv.org/html/2606.26360#bib.bib14)\]and\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]have shown that words’ pitch signatures can be predicted from their meanings \(operationalized with contextualized embeddings\) with above\-chance accuracy, indicating that there is some isomorphy between the phonetic space of pitch contours and the embedding space\. This raises the following question:Are contextualized embeddings for Beijing conversational speech predictive for words’ tonal signatures as previously observed for Taiwan Mandarin?Given the substantial variability and greater prevalence of the neutral tone in Beijing Mandarin, this is not a trivial question\. A related question is whether differences in words’ pitch signatures in Beijing and Taiwan can be traced to differences in words’ meanings in these two dialects, as gauged with contextualized embeddings\.

The remainder of this study is structured as follows\. Section[2](https://arxiv.org/html/2606.26360#S2)provides further background on the neutral tone in Standard Mandarin, Beijing Mandarin, and Taiwan Mandarin\. Section[3](https://arxiv.org/html/2606.26360#S3)introduces the speech corpora for Beijing and Taiwan and the datasets that we created from these corpora\. Section[4](https://arxiv.org/html/2606.26360#S4)presents the statistical analyses of the pitch contours, using generalized additive mixed modeling\. Section[4\.3](https://arxiv.org/html/2606.26360#S4.SS3)reports our investigation of the relation between word\-specific pitch signatures and word\-specific meaning, using contextualized embeddings\. Section[5](https://arxiv.org/html/2606.26360#S5)provides the general discussion\.

## 2The neutral tone

The neutral tone occurs in unstressed syllables and is widely attested in grammatical morphemes\. These include particles and suffixes \(e\.g\., 了\-le, 个\-ge, 子\-zi\), structural particles \(的\-de\), aspect markers \(着\-zhe\), locative elements \(上\-shang\), directional complements \(来\-lai\), as well as reduplicated forms \(e\.g\., 妈妈ma1ma5, ‘mom’\)\.

The neutral tone, also referred to as the “floating tone”, is traditionally described\[[6](https://arxiv.org/html/2606.26360#bib.bib3)\]as a separate tone category next to the four lexical tones\. It is often analyzed as resulting from the neutralization of tonal contrasts in unstressed positions\[[56](https://arxiv.org/html/2606.26360#bib.bib57)\]\. It is also commonly referred to as qīng shēng \(轻声\) in Chinese, which literally means the “light” or “soft tone”\. Syllables with the neutral\-tone are typically produced with shorter duration, lower intensity, less articulatory effort, and weakened vowel contrasts\[[3](https://arxiv.org/html/2606.26360#bib.bib20),[8](https://arxiv.org/html/2606.26360#bib.bib24),[52](https://arxiv.org/html/2606.26360#bib.bib8)\]\.

The pitch contours of floating tones have been reported to depend in part on the preceding lexical tone, and tend to converge to a mid or low pitch target by the end of the carrying syllable\[[6](https://arxiv.org/html/2606.26360#bib.bib3),[45](https://arxiv.org/html/2606.26360#bib.bib52),[57](https://arxiv.org/html/2606.26360#bib.bib58)\]\. Although the precise realization of the floating tone varies across studies, the pattern in the literature suggests that when following Tone 1, Tone 2, or Tone 4, the neutral tone tends to be realized with a falling contour or with an overall lower pitch than the offset of the preceding tone\. However, when following Tone 3, it is often realized with a rising contour or with a pitch higher than the preceding offset\[[5](https://arxiv.org/html/2606.26360#bib.bib22),[6](https://arxiv.org/html/2606.26360#bib.bib3),[7](https://arxiv.org/html/2606.26360#bib.bib23),[45](https://arxiv.org/html/2606.26360#bib.bib52),[27](https://arxiv.org/html/2606.26360#bib.bib37),[4](https://arxiv.org/html/2606.26360#bib.bib21),[14](https://arxiv.org/html/2606.26360#bib.bib28),[34](https://arxiv.org/html/2606.26360#bib.bib43),[57](https://arxiv.org/html/2606.26360#bib.bib58),[39](https://arxiv.org/html/2606.26360#bib.bib46)\]\.

The realization of the neutral tone in connected speech is further co\-determined by a wide range of factors, such as intonational context \(e\.g\., declarative versus interrogative\), prosodic structure \(e\.g\., disyllabic versus trisyllabic sequences\), the number of neutral tones within a sequence, and speech style\. While sentence\-final lexical tones in interrogative utterances typically exhibit a rising contour, sentence\-final neutral tones may be realized with a falling contour in questions\[[35](https://arxiv.org/html/2606.26360#bib.bib44)\]\. The realization of the neutral tone realization varies across disyllabic and trisyllabic sequences, with greater reduction often observed in longer sequences or in later syllables\. Sequences containing multiple neutral tones may show cumulative reduction effects in both duration and pitch\[[53](https://arxiv.org/html/2606.26360#bib.bib2),[52](https://arxiv.org/html/2606.26360#bib.bib8)\]\. In addition,\[[52](https://arxiv.org/html/2606.26360#bib.bib8)\]reported that speech style further conditions neutral tone realization, with more extreme reduction typically observed in casual speech compared to more controlled styles such as broadcast news\.

\[[23](https://arxiv.org/html/2606.26360#bib.bib33)\]distinguished between nine different types of neutral tones, such as ‘obligatory neutral tone’ words, ‘habitually neutralized’ forms, ‘optionally’ neutralizated words, and ‘contrastive neutral tone’ words\. Contrastive neutral words are found in the second syllable of certain disyllabic words, such as 东西dongxiand 大意dayi\), which can be realized with different tone patterns depending on their meaning:dong1xi1‘east\-west’,dong1xi5‘thing’;da4yi4‘general idea’,da4yi5‘careless’\.

### 2\.1Standard and Beijing Mandarin

Standard Mandarin is based on Beijing Mandarin, but the two are not identical\. As the official standard language of China, Putonghua takes the pronunciation of Beijing Mandarin as its norm while drawing on northern dialects for its grammatical foundation\. Although the neutral tone is a common feature across northern Mandarin varieties, it is particularly prominent in Beijing Mandarin\[[10](https://arxiv.org/html/2606.26360#bib.bib26),[42](https://arxiv.org/html/2606.26360#bib.bib49)\], where it may also function as a sociolinguistic index of local identity\[[58](https://arxiv.org/html/2606.26360#bib.bib59)\]\. In Standard Chinese, about 15–20% of the syllables in written texts are considered unstressed, including certain suffixes, clitics, and particles\. Beijing Mandarin, however, has a particularly high prevalence of neutral tones\.

According to\[[13](https://arxiv.org/html/2606.26360#bib.bib27)\], the neutral tone should be understood as a flexible tone on a continuum from fully specified lexical tones to toneless syllables, with its actual realization shaped by contextual factors\. Within this continuum, the so\-called “forbidden” neutral tone words are of special interest\. Words with the forbidden neutral tone are often neutralized in Beijing Mandarin, but are prescribed to carry full lexical tones in Standard Mandarin\. As a consequence, the Beijing dialect contains a larger inventory of neutral\-tone words than Standard Mandarin\[[22](https://arxiv.org/html/2606.26360#bib.bib32)\]\.

Phonetically, in Beijing Mandarin, syllables with neutral tones have been found to be more reduced than in many other Mandarin varieties, exhibiting shorter duration and lower intensity\[[36](https://arxiv.org/html/2606.26360#bib.bib45)\]\. Similar to standard Mandarin, their F0 realization is highly dependent on the preceding lexical tone, both in contour shape and pitch height\[[28](https://arxiv.org/html/2606.26360#bib.bib36)\]\.

The northern varieties of Mandarin are relatively homogeneous compared to the considerable diversity observed among southern varieties\. While most regional varieties, such as Guangzhou, Shanghai, Yantai, share the same basic tonal categories as Beijing Mandarin, the phonetic realization of these tones, in particular their pitch contours, varies substantially across regions\[[33](https://arxiv.org/html/2606.26360#bib.bib42)\]and serves as an important marker of one’s local dialect\.

The phonetic realization of the neutral tone has been investigated for a range of Mandarin varieties, including Changde Mandarin\[[59](https://arxiv.org/html/2606.26360#bib.bib60)\], Changsha Mandarin\[[54](https://arxiv.org/html/2606.26360#bib.bib56)\], Ürümqi Mandarin\[[20](https://arxiv.org/html/2606.26360#bib.bib30),[50](https://arxiv.org/html/2606.26360#bib.bib55)\], Taiwan Mandarin\[[24](https://arxiv.org/html/2606.26360#bib.bib34)\], Yichang Mandarin\[[31](https://arxiv.org/html/2606.26360#bib.bib40)\], and Tianjin Mandarin\[[30](https://arxiv.org/html/2606.26360#bib.bib39)\]\.\[[32](https://arxiv.org/html/2606.26360#bib.bib41)\]provides a comparative analysis of Chongqing, Kunming, and Nanjing Mandarin\. Differences in neutral tones between Changsha Mandarin and Standard Mandarin\[[53](https://arxiv.org/html/2606.26360#bib.bib2)\]demonstrate variation across social strata and speech levels\. Whereas second syllables of some disyllabic words are also unstressed in Northern Mandarin accents, many Mandarin speakers in Southern China tend to preserve their inherent tones\.

### 2\.2Taiwan Mandarin

Taiwan Mandarin, which has been influenced by Southern Min, differs from Standard Mandarin in several respects\. One notable characteristic is that the distinction between stressed and unstressed syllables is less perceptually salient\[[26](https://arxiv.org/html/2606.26360#bib.bib35)\]\.\[[29](https://arxiv.org/html/2606.26360#bib.bib38)\]reported that the neutral tone in Taiwan Mandarin exhibits a relatively stable pitch target,\[[48](https://arxiv.org/html/2606.26360#bib.bib54)\]described the neutral tone in Taiwan Mandarin as resembling a low “entering tone” with short duration\.\[[24](https://arxiv.org/html/2606.26360#bib.bib34)\]reported that in Taiwan Mandarin the neutral tone may either extend the pitch contour of the preceding tone or converge toward a relatively stable pitch target in the mid\-low to low range\. As not all syllables lacking lexical stress are necessarily reduced in the same way,\[[24](https://arxiv.org/html/2606.26360#bib.bib34)\]argues that the neutral tone in Taiwan Mandarin is better analyzed as unaccented rather than strictly unstressed\.

The use of neutral tone is reported to be less frequent in Taiwan Mandarin than in Standard Mandarin and Beijing Mandarin\.\[[24](https://arxiv.org/html/2606.26360#bib.bib34)\]compared the Putonghua Shuiping Ceshi Qingsheng Cibiao \(Standard Mandarin Proficiency Test wordlist\) with the Revised Mandarin Chinese Dictionary published by Taiwan’s Ministry of Education and observed that while suffixes and reduplicated syllables are consistently marked as having a neutral tone, only about half of the syllables that have neutral tones in Standard Mandarin carry a neutral tone in Taiwan Mandarin\. For example, the experiential aspect suffix 過 and directional complements such as 上 ‘up’ and 來 ‘come’ are often pronounced with their canonical tones \(guo4,shang4, andlai2\) rather than as reduced syllables\.

As for words with reduplicated syllables, only kinship terms are produced with a neutral tone in Taiwan Mandarin\. However, to express endearment, the neutral tones of kinship terms can be realized as rising or high tones, as in 妈妈ma3ma2\(‘mother’\) and 姐姐jie3jie1\(‘older sister’\)\. For verbal and nominal reduplication, the tone of the initial syllable is often maintained: 看看 \(‘take a look at’\) is realized askan4kan4rather than askan4kan5\[[21](https://arxiv.org/html/2606.26360#bib.bib31)\]\.

### 2\.3The semantics of the floating tone

As mentioned above, several studies have reported that in conversational Taiwan Mandarin, words do not only have a pitch component that reflects the canonical tones, but also a pitch component that most likely reflects words’ semantics\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[25](https://arxiv.org/html/2606.26360#bib.bib13),[38](https://arxiv.org/html/2606.26360#bib.bib11)\]\. The statistical tool that led to the discovery of word\-specific tonal signatures is the generalized additive model\[[51](https://arxiv.org/html/2606.26360#bib.bib7)\]\. Figure[1](https://arxiv.org/html/2606.26360#S2.F1)illustrates the component reflecting words’ tone pattern for a toy example with 5 words with the T4\-T5 tone pattern\. The left panel presents words’ individual pitch contours in normalized time\. The right panel shows the pitch component that the GAM reconstructs as being shared by all five words\. Individual words deviate from this shared pitch component\. When pitch data are available for many word tokens of the same word type, then it becomes possible to not only isolate the pitch component shared by all tokens of all word types with a given tone pattern, but to also isolate for each word type what the characteristic tone component of that word type is\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x1.png)Figure 1:Toy dataset\. The left panel shows the F0 contours of five selected tokens with T4\-T5 tone pattern, produced by the same speaker, from the corpus of spoken Beijing Mandarin\. The right panel shows the fitted F0 contours predicted by a simple GAM, using a thin plate regression spline smooth for normalized time as predictor\.It is of course essential to take into account the many different prosodic factors that have been shown to modulate the phonetic realization of pitch\. Figure[2](https://arxiv.org/html/2606.26360#S2.F2)presents the pitch components of the tone patterns with a final floating tone, isolated by a GAM for the dataset of conversational Taiwan Mandarin studied by\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]\. These pitch components are modulated by the preceding and following tones \(3…4, 4…0, 4…1, 4…4\), represented by shades of blue\. The average of these components is shown in red\. The overall trend appears to be a downward sloping pitch contour that is rather similar irrespective of the lexical tone of the first syllable\. The GAM of\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\], and the GAM models to be presented below, take into account many more prosodic factors \(duration, speaker, gender, position in the utterance, bigram probability\), in order to optimize estimates of the pitch components tied to tone patterns and the pitch signatures tied to individual word types\.

In the study of\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\], tokens and types of words with a neutral tone on the second syllable were less frequent than the types and tokens of words with tone patterns consisting of only lexical tones\. We therefore carried out a follow\-up study for Taiwan Mandarin, which we complemented with a replication study for Beijing Mandarin\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x2.png)

Figure 2:The average predicted pitch contours for the T1\-T5, T2\-T5, T3\-T5, and T4\-T5 tone patterns\. The contours are reproduced from the black curves in Figure 5 of\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]\.Note\.\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]labels the neutral tone as T0, whereas the present study uses T5\. Accordingly, the original tone patterns T1\-T0, T2\-T0, T3\-T0, and T4\-T0 are presented here as T1\-T5, T2\-T5, T3\-T5, and T4\-T5\.

## 3Materials

The data analyzed in the current study are drawn from spontaneous speech corpora representing two varieties of Mandarin: Beijing Mandarin and Taiwan Mandarin\. This section provides a detailed description of these corpora, acoustic processing, and data sampling\.

### 3\.1Corpora

The data for Beijing Mandarin data were drawn from a corpus of conversational speech\[[44](https://arxiv.org/html/2606.26360#bib.bib51)\], consisting of recordings from 50 speakers \(25 female, 25 male\)\. All recordings were conducted in a soundproof recording studio in Beijing using high\-quality audio equipment\.

The Taiwan Mandarin data were extracted from the Taiwan Mandarin Spontaneous Speech Corpus\[[15](https://arxiv.org/html/2606.26360#bib.bib5)\], which contains approximately 30 hours of free and unstructured interview speech\. This corpus includes 55 native speakers of Taiwan Mandarin \(31 female, 24 male\) aged between 20 and 60 years\. Recordings were collected in Taipei over several years between 2000 and 2010 using high\-quality microphones and digital recorders in quiet locations, preferably in soundproof laboratories\. Before the interviews, participants were informed that the study aimed to investigate their views on various aspects of life\. Interviewers minimized active participation in order to elicit extended monologues, covering topics such as childhood, education, work, personal relationships, and other domains of life\. After the interview, participants were asked for consent for the use of their recordings in research\.

Both corpora followed comparable elicitation strategies designed to encourage long stretches of spontaneous speech while minimizing interviewer intervention\. As a result, the recordings consist of naturally occurring speech covering broadly comparable thematic domains\.

To further assess comparability in discourse content, we made use of topic modeling, an unsupervised machine learning technique for identifying latent thematic structure\. We used the BERTopic framework\[[18](https://arxiv.org/html/2606.26360#bib.bib29)\]to model topics in the corpora\. Prior to modeling, the data were preprocessed through tokenization and the removal of stopwords, using corpus\-specific stopword lists for Beijing Mandarin \(Simplified Chinese\) and Taiwan Mandarin \(Traditional Chinese\)\. Topic interpretation was based on the highest\-weight \(i\.e\., most representative\) words within each topic\. Results indicate that both corpora cover a wide range of social, cultural, technological, and personal\-experience\-related topics, although the specific thematic emphases may differ\. In the Beijing Mandarin corpus, prominent topics include national and regional topics, mobile technology and digital consumption, travel and sports, urban public transportation infrastructure, everyday mobility, and emerging technologies\. The Taiwan Mandarin corpus features frequently discussed such as health and medicine, marriage and family, music, education and extracurricular activities, natural disasters and environmental risk, and leisure activities such as golf\.

### 3\.2Transcription and selection criteria

The audio recordings of Beijing Mandarin were transcribed and annotated using simplified Chinese characters\. Forced alignment between the audio and transcriptions was carried out using the Montreal Forced Aligner\[[41](https://arxiv.org/html/2606.26360#bib.bib48), MFA,\]\. Pauses, which commonly occur in spontaneous speech, were identified and labeled as empty intervals\. The resulting automatic segmentation was subsequently manually checked by native speakers\. Fewer than 5% of tokens showed incorrect or imprecise alignments and were subsequently corrected, following established practices in previous studies\[[46](https://arxiv.org/html/2606.26360#bib.bib53)\]\.

The Taiwan Mandarin recordings were transcribed using traditional Chinese characters\. Word boundaries in the orthographic representation were identified using a word segmentation system\[[40](https://arxiv.org/html/2606.26360#bib.bib47)\], and canonical tone information was assigned based on dictionaries\. The character transcriptions were then romanized to enable forced alignment with Easyalign\[[17](https://arxiv.org/html/2606.26360#bib.bib19)\]on character and word levels\. Transcriptions were aligned at the word and syllable levels with the audio\. The resulting alignment was manually checked and, where necessary, corrected\.

Since Beijing and Taiwan Mandarin differ with respect to which words carry floating tones,111E\.g\., 厉害 \(‘severe; impressive’\) is listed asli4hai5\(T4–T5\) in Beijing standard, but asli4hai4\(T4–T4\) in Taiwan standardfor our cross\-dialect comparison, we decided to select words based on the tones as indicated inXiandai Hanyu Cidian\(现代汉语词典\) \[the Contemporary Chinese Dictionary \(7th edition\)\]\[[9](https://arxiv.org/html/2606.26360#bib.bib25)\], as this maximizes the opportunity to find words that are, at least in one of the dialects, realized with a floating tone\. We also extracted the normative tones for Taiwan Mandarin, using the Ministry of Education Dictionary \(教育部国语辞典简编本\), a widely used and regularly updated reference for contemporary, frequently used vocabulary in Taiwan Mandarin\. Our analysis focused on disyllabic words with a lexical tone \(T1, T2, T3 and T4\) on the first syllable and a neutral tone \(Tone 5\) on the second syllable, which corresponds to the T1\-T5, T2\-T5, T3\-T5, T4\-T5 tone patterns\. All selected words in the Beijing dataset conform to this pattern according to the Beijing standard\. However, some of these words are listed with a full lexical tone on the second syllable in Taiwan Mandarin dictionaries\. Importantly, to preserve cross\-variety comparability, such items were retained in the Taiwan dataset but grouped into a separate category labeled “Others”\. This category comprises those words that are classified as neutral\-tone items in Beijing Mandarin but as bearing a full lexical tone in Taiwan Mandarin\.

For each of the two corpora, we selected all tokens of two\-syllable word types with a neutral tone \(according to the authoritative dictionaries of their respective dialects\)\. Following\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[37](https://arxiv.org/html/2606.26360#bib.bib12),[38](https://arxiv.org/html/2606.26360#bib.bib11), e\.g\.,\], we selected only those word types for which at least five tokens were available in the corpus, to ensure sufficient representation for statistical modeling\. Furthermore, in order to prevent model prediction being dominated by a few highly frequent words, the maximum number of tokens per word type was capped at 200\. In addition, we imposed the requirement that each word type was produced by at least one female and one male speaker to avoid gender\- and speaker\-based imbalance\.

As shown in Table[1](https://arxiv.org/html/2606.26360#S3.T1), the resulting Beijing Mandarin dataset contains 4,871 tokens representing 101 word types across four tone patterns \(T1\-T5, T2\-T5, T3\-T5, and T4\-T5\)\. The Taiwan Mandarin dataset contains 3,831 tokens representing 80 word types, across five categories \(T1\-T5, T2\-T5, T3\-T5, T4\-T5, andOthers\)\. In general, the mean number of tokens per type is comparable between the two datasets\.

Table 1:Summary of tokens and word types in each tone pattern, including the shared subset for Beijing and Taiwan Mandarin\.The “Others” category contains 26 word types \(see Table[2](https://arxiv.org/html/2606.26360#S3.T2)\)\. Although many of these words are realized as T5 in Beijing Mandarin, they are classified as Others here because the Taiwan Mandarin Dictionary specifies a full lexical tone rather than Tone 5 on the second syllable\.

Table 2:The 26 word types in theOtherscategory of the Taiwan Mandarin dataset\. These are disyllabic words whose second syllable does not carry T5\. Pinyin follows the Taiwan Mandarin Dictionary\.
### 3\.3Acoustic processing

F0 extraction was carried out using Praat\[[2](https://arxiv.org/html/2606.26360#bib.bib4)\], implemented via theParselmouthPython interface\. For each token, F0 was measured across the entire syllable, including onset consonants, vowels, and final nasals, if present\. The pitch floor and ceiling were set to 75–400 Hz for female speakers and 50–300 Hz for male speakers\. No smoothing or interpolation was applied in order to preserve the original acoustic speech signal\. F0 values were not computed in cases where voicing was absent \(e\.g\., voiceless consonants\) or where creaky voice prevented reliable estimation; these values were coded as missing \(NA\)\. Abrupt pitch jumps between consecutive time points, probably due to F0 tracking errors, were excluded\. Tokens with potential F0 tracking errors or creaky voice were identified through visual inspection by native speakers and excluded from further analysis\. Thus, despite differences in the alignment tools used for the two corpora, extensive manual inspection of the tokens selected for analysis with respect to alignment accuracy ensured comparability across the two datasets\.

### 3\.4Predictors

For statistical analysis, we collected the following predictors:

normalized timeFor each word token, time was normalized between 0 and 1, allowing tokens with different durations to be modeled on a common time scale\.

tone patternThe tonal pattern of two\-syllable word, the combination of the lexical tone of the first syllable and the neutral tone of the second syllable\. In the Beijing Mandarin dataset, this factor has four levels:T1\-T5,T2\-T5,T3\-T5, andT4\-T5\. In the Taiwan Mandarin dataset, it has five levels:T1\-T5,T2\-T5,T3\-T5, andT4\-T5, andOthers\.

wordThe orthograghy of word, which is a factor with 101 levels in the Beijing dataset and a factor with 80 levels in the Taiwan Mandarin dataset\.

speakerA unique speaker identifier, a factor with 50 levels for the Beijing dataset and with 55 levels for the Taiwan Mandarin dataset\.

durationA continuous variable indicating the duration of a token, measured in seconds\.

word position in utteranceThe normalized position of a word within its utterance\. This is calculated by dividing the word’s position by the total number of syllables in the utterance, yielding a value between 0 and 1\. Lower values indicate that the token occurs closer to the beginning of the utterance, and higher values indicate that it occurs closer to the end of the utterance\. For single\-word utterances, the position is coded as 1\. Utterance is defined as a continuous stretch of speech bounded by silent intervals in the corpus segmentation\.

tonal contextA factor with 36 levels, comprising all combinations of the tone on the syllable preceding the word and the tone on the syllable following the word\. If a pause occurs immediately before or after a word, the corresponding tone is coded asPAUSE\. Speakers occasionally engage in code\-switching, typically between English and Chinese\. For tokens immediately preceded or followed by an English word, no tone could be reliably identified; these tokens were therefore labeled asNAand excluded from further analyses\. With five tones \(T1, T2, T3, T4, and T5\) and thePAUSE, and two positions \(preceding and following\), the number of levels of tonal context is 6×\\times6 = 36\.

bigram probability of the previous wordBigram probability quantifies how predictable a word is in its context\. This measure of contextual predictability is based on the relative frequency of the word’s co\-occurrence with surrounding words\. A higher bigram probability indicates that the target word is more predictable within its given context\. In the present study,bg\_prob\_previs calculated as the probability of the occurrence of the target word given the preceding word\.

bigram probability of the following wordThis measure represents the bigram probability of the following word, calculated as the probability of the occurrence of the target word given the following word\.

## 4Analysis

We made use of the Generalized Additive Model\[[51](https://arxiv.org/html/2606.26360#bib.bib7),[12](https://arxiv.org/html/2606.26360#bib.bib10), GAM,\]to decompose pitch contours over normalized time into a series of component contours, using the implementation of themgcvpackage\[[51](https://arxiv.org/html/2606.26360#bib.bib7)\]for R\[[47](https://arxiv.org/html/2606.26360#bib.bib1)\]\. We implemented an AR\(1\) process \(first\-order autoregressive model\) in the residuals to take into account that there are strong autocorrelation in time series of pitch measurements\. The inclusion of the AR\(1\) process with an autocorrelation coefficient ofρ=0\.90\\rho=0\.90substantially reduced residual autocorrelation\.

The response variable was the natural logarithm\-transformed F0\.222The log transformation resulted in a distribution of F0 values that roughly followed a normal distribution\. We did not use MEL or BARK scales, as our interest is primarily in the production of pitch contours, rather than the comprehension of pitch contours\.Pitch contours were modeled over entire words, including onset consonants, vowels, and final nasals, if present\.333For detail discussion of GAMs reconstruct pitch contours exactly as desired for voiceless segments, see\[[11](https://arxiv.org/html/2606.26360#bib.bib14)\]\.

We analyzed the Beijing and Taiwan datasets separately\. For each variety, we started with a simple baseline model that included a smooth for normalized time, as well as by\-speaker random nonlinear components, using a factor smooth with shrinkage\. These by\-speaker components account for differences in how speakers, on average, realize pitch contours\.

The baseline GAM was specified as follows:

logpitch∼\\displaystyle\\texttt\{logpitch\}\\sims\(normalized\_t, k = 5\)\(1\)\+s\(normalized\_t, speaker, bs = ’fs’, m = 1\)\\displaystyle\+\\texttt\{s\(normalized\\\_t, speaker, bs = 'fs', m = 1\)\}We then fitted 7 additional GAMs, each with one additional predictor \(word,tonal context,tone pattern,word position,duration, andbigram probability previousandbigram probability following\)\. Forword,tonal contextandtone patternwe implemented factor smooth interactions with normalized time\. For the covariates, we implemented tensor product smooths to model their interactions with normalized time\. As many of these predictors are correlated, these simple models make it straightforward to assess the relative variable importance of the individual predictors\. We assessed variable importance by calculating the extent to which model fit improved using Akaike’s Information Criterion \(AIC\): greater reductions in AIC imply improved model fit\.

Figure[3](https://arxiv.org/html/2606.26360#S4.F3)presents the reductions in AIC obtained for each of the predictors, for Beijing Mandarin \(left panel\) and Taiwan Mandarin \(right panel\)\. In Beijing Mandarin,wordyielded a substantially greater improvement in model fit than any other predictor, with an AIC reduction of 20874 units\.Tonal contextranked second, followed bytone patternandword position\. In contrast, for Taiwan Mandarin,tonal contextemerged as the most important predictor, followed byword positionandword\. Comparing the results for the two dialects,wordis the more important predictor for the Beijing dataset, followed bytonal context, whereas for the Taiwan dataset,tonal contextis the most important variable, followed bywordandword position\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x3.png)Figure 3:Reduction in AIC when a predictor of interest is added to the baseline GAM\. Larger reductions indicate greater variable importance\. Results are shown separately for Beijing Mandarin \(left panel\) and Taiwan Mandarin \(right panel\)\.We then fit GAMs including all predictors, using the following model specification:

logpitch∼\\displaystyle\\texttt\{logpitch\}\\simtone\_pattern\(2\)\+s\(normalized\_t, by = tone\_pattern, k = 5\)\\displaystyle\+\\texttt\{s\(normalized\\\_t, by = tone\\\_pattern, k = 5\)\}\+s\(normalized\_t, speaker, bs = ’fs’, m = 1\)\\displaystyle\+\\texttt\{s\(normalized\\\_t, speaker, bs = 'fs', m = 1\)\}\+s\(normalized\_t, word, bs = ’fs’, m = 1\)\\displaystyle\+\\texttt\{s\(normalized\\\_t, word, bs = 'fs', m = 1\)\}\+s\(normalized\_t, tonal\_context, bs = ’fs’, m = 1\)\\displaystyle\+\\texttt\{s\(normalized\\\_t, tonal\\\_context, bs = 'fs', m = 1\)\}\+s\(norm\_utt\_pos, k = 5\)\\displaystyle\+\\texttt\{s\(norm\\\_utt\\\_pos, k = 5\)\}\+ti\(normalized\_t, norm\_utt\_pos, k = c\(4,4\)\)\\displaystyle\+\\texttt\{ti\(normalized\\\_t, norm\\\_utt\\\_pos, k = c\(4,4\)\)\}\+s\(logdur, k = 8\)\\displaystyle\+\\texttt\{s\(logdur, k = 8\)\}\+ti\(normalized\_t, logdur, k = c\(5,5\)\)\\displaystyle\+\\texttt\{ti\(normalized\\\_t, logdur, k = c\(5,5\)\)\}\+s\(bg\_prob\_prev, k = 5\)\\displaystyle\+\\texttt\{s\(bg\\\_prob\\\_prev, k = 5\)\}\+ti\(normalized\_t, bg\_prob\_prev, k = c\(5,5\)\)\\displaystyle\+\\texttt\{ti\(normalized\\\_t, bg\\\_prob\\\_prev, k = c\(5,5\)\)\}\+s\(bg\_prob\_fol, k = 5\)\\displaystyle\+\\texttt\{s\(bg\\\_prob\\\_fol, k = 5\)\}\+ti\(normalized\_t, bg\_prob\_fol, k = c\(5,5\)\)\\displaystyle\+\\texttt\{ti\(normalized\\\_t, bg\\\_prob\\\_fol, k = c\(5,5\)\)\}
This model improved substantially on the single predictor GAMs \(decrease in AIC compared to the best fitting single predictor models: 32704\.17 units for the Beijing dataset and 28593\.11 units for the Taiwan dataset\)\. This complex model, which brings a wide range of prosodic factors under statistical control, has one downside\. Due to the correlations between predictors, there is some concurvity in the model\. High concurvity scores indicate that the partial effect of one predictor is predictable from other predictors in the model, rendering model interpretation more difficult\. Figure[4](https://arxiv.org/html/2606.26360#S4.F4)presents the estimated concurvity scores for Beijing Mandarin \(left panel\) and Taiwan Mandarin \(right panel\)\. The concurvity scores are highest for four control variables for prosody:word position, duration, previous bigram probability, andfollowing bigram probability\. Unsurprisingly, concurvity scores are the lowest forspeaker, followed bytonal context\. Concurvity scores are also relatively low forword\. The low concurvity scores forwordandtonal contextindicate that the partial effects of these predictors are interpretable, as part of a larger model that controls for prosodic factors\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x4.png)

Figure 4:Estimated concurvity scores for smooth terms in the best\-fit GAMs for Beijing Mandarin \(left panel\) and Taiwan Mandarin \(right panel\)\. Lower concurvity values indicate that a predictor is contributing more independently to the model fit\. The concurvity scores for thewordsmooth are highlighted in red\.Note\.For models with continuous predictors, only the concurvities of smooth terms involving normalized time are shown\. The relatively high concurvity values observed for tone\-pattern smooths are unsurprising given the substantial similarity among the pitch trajectories associated with the four tone patterns \(cf\. Figure[6](https://arxiv.org/html/2606.26360#S4.F6)\)\.

Tonal context is a strong co\-determinant of pitch contours in Taiwan Mandarin\. Figure[5](https://arxiv.org/html/2606.26360#S4.F5)illustrates its effect on the predicted contours while holding the tone pattern constant\. The left and right panels show the 36 levels of tonal context for Beijing Mandarin and Taiwan Mandarin, respectively\. Interestingly, when a neutral tone is followed by another neutral tone \(highlighted in red and blue, respectively\), the predicted F0 at the end of the syllable is lower than in other tonal contexts, especially for Beijing Mandarin\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x5.png)Figure 5:The modulation of tonal context on predicted pitch contours, showing all 36 tonal context levels in Beijing Mandarin \(left panel\) and Taiwan Mandarin \(right panel\)\. The predicted pitch contours are estimated from the partial effect of the factor smooth for tonal context\. In both panels, contexts in which the neutral\-tone syllable is followed by another neutral tone are highlighted by color\. They represent tonal sequences T1–X–T5–T5, T2–X–T5–T5, T3–X–T5–T5, T4–X–T5–T5, and PAUSE–X–T5–T5, in which “X–T5” represents an average of disyllabic word across four tone patterns\. All other tonal contexts are shown in grey\.In what follows, we proceed with examining the partial effects of tone pattern, and then consider the partial effect of word\.

### 4\.1Tone pattern

Figure[6](https://arxiv.org/html/2606.26360#S4.F6)represents the predicted pitch contours for tone patterns \(T1\-T5, T2\-T5, T3\-T5, T4\-T5\) in Beijing \(left panel\) and Taiwan Mandarin \(right panel\)\. The first half of each contour before the average syllable boundary, corresponding to the normalized time interval \[0, 0\.49\], represents the F0 contour of the lexical tone, while the second half \[0\.49, 1\] represents that of the neutral tone syllable\. The predicted pitch contours show narrower confidence intervals and higher pitch height in Beijing Mandarin than in Taiwan Mandarin\. The narrower confidence intervals for Beijing Mandarin dovetail well with the observation that tone pattern ranks as the third most important predictor in Beijing Mandarin, but as the least important in Taiwan Mandarin in the variable importance analysis \(see Figure[3](https://arxiv.org/html/2606.26360#S4.F3)\)\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x6.png)Figure 6:Predicted pitch contours for tone patterns in Beijing \(left panel\) and Taiwan \(right panel\)\. The pitch contours shown represent the partial effect of tone pattern, in combination with the corresponding tone\-pattern specific intercepts\. Separate GAMs were fitted to the two dialects\. The vertical dashed line indicates the average syllable boundary on the normalized time scale, with the first half corresponding to the initial lexical tone, and the second half corresponding to the neutral\-tone syllable\. They are obtained from two separate models for Beijing Mandarin and Taiwan Mandarin\.In Taiwan Mandarin, the first halves of the F0 contours largely resemble the prototypical citation forms of the respective lexical tones\. T1 is high and level throughout the syllable, remaining higher than T2 and T3\. T2 might be realized with a rising contour, but as the confidence interval is very wide, the T2 of the Taiwan T2\-T5 pattern might just as well be described as a mid level tone\. T3 displays a falling trend, with the rise beginning around the average syllable boundary\. T4 closely resembles its citation form with a clear fall\. Overall, lexical tones in the stressed position of two\-syllables largely maintain their canonical shapes\.

In Beijing Mandarin, the realization of lexical tones shows greater deviation from citation forms\. T1 and T4 have an initially rising F0 in Beijing Mandarin\. T2 exhibits a small initial fall followed by a modest rise beginning around 70% into the rhyme of the first syllable, and that continues in the second syllable\. T3 is realized with strong initial fall that continues slightly into the second syllable\.

Figure[7](https://arxiv.org/html/2606.26360#S4.F7)shows the estimated difference curve among pairs of tone patterns in Beijing Mandarin \(left\) and Taiwan Mandarin \(right\)\. These difference curves were estimated by theplot\_difffunction fromitsadugpackage in R\. The red areas indicate the normalized time intervals where the estimated difference curves are significantly different from zero, whereas the grey areas mark the time intervals where no significant difference is observed\. The difference curve for T1\-T5 and T2\-T5 in Beijing Mandarin, as well as the difference curve for T3–T5 and T4–T5, includes zero at the endpoints of the pitch contours, indicating no significant differences in F0 at word offsets\. All other pairs show significant differences in F0 at word offset\. In Taiwan Mandarin, the confidence intervals of the estimated F0 differences approach zero toward the end of the contour across all comparisons\. Consistent with earlier observations\[[24](https://arxiv.org/html/2606.26360#bib.bib34)\], this indicates that in Taiwan Mandarin, the neutral tone has a final mid\-to\-low target F0\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x7.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.26360v1/x8.png)\(b\)

Figure 7:Difference curves for pairs of tone patterns in Beijing Mandarin \(left panel\) and Taiwan Mandarin \(right panel\)\. The red area indicates that the normalized time domain where the estimated difference curve is significant different from zero, whereas grey area marks the normalized time domain with no significant difference\.The predicted contours of the neutral tone display systematic variation across tonal contexts in both Beijing and Taiwan Mandarin\. In both varieties, the F0 of the neutral tone generally approaches a final medium\-to\-low pitch target in both Beijing and Taiwan Mandarin\. This final target appears to be somewhat lower for Taiwan Mandarin, but the wide confidence intervals argue for caution\. Unlike for Beijing Mandarin, all four tone patterns converge to the same word\-final target, as shown by the difference curve analysis \(see Figure[7](https://arxiv.org/html/2606.26360#S4.F7)\)\.

In the light of these findings, we can answer the first two research questions as follows\. With respect tothe nature of the neutral tone, it is clear that lexical tones followed by a neutral tone have their own specific tone patterns that cannot be reduced to some general tone sandhi rule\. Some tone patterns can be interpreted as drawn\-out versions of the initial lexical tone\. For Beijing Mandarin, the T2\-T5 tone pattern could be interpreted as a rising tone with a late start of the rise\. Likewise, the T3–T5 sequence in this variety could be seen as a dipping tone that is realized over both syllables\. For Taiwan, the T3\-T5 sequence can likewise be described as a dipping tone drawn out over both syllables\. For Taiwan Mandarin, furthermore, the T4\-T5 pattern can also be seen as a falling tone, drawn out over both syllables, with a late start of the fall\. But other tone patterns have their own specific realizations, such as T1\-T5 and T4\-T5 in Beijing Mandarin, and T2\-T5 in Taiwan Mandarin\. The only property shared by all T2 words is the mid\-to\-low final target\. We take this property to be the defining characteristic of the T2\-words, just as a final fall is a characteristic of two\-syllable words with a T4 on the second syllable\[[38](https://arxiv.org/html/2606.26360#bib.bib11), cf\.\]\.

With respect to the question ofwhether there are differences in the realization of tone patterns between Beijing and Taiwan Mandarin, it is clear that marked differences indeed exist\. The initial rise found for T1\-T5 and T4\-T5 words sets Beijing Mandarin apart from Taiwan Mandarin\. The nearly level realization of T2\-T5 sets Taiwan Mandarin apart from Beijing Mandarin\. The tone pattern that is most similar across the two varieties is T3\-T5\. Furthermore, there is more variability in the realization of tone patterns in Taiwan Mandarin, and in this variety, the word final tone target is the same for all four tone patterns\.

### 4\.2Word

We next consider the question of whether words with a T2 on the second syllable have their own specific pitch signatures\. The GAM analyses indeed provide strong support for word\-specific tonal signatures not only for Taiwan Mandarin, but also for Beijing Mandarin\. In fact, the variable importance for word in Beijing Mandarin is considerably stronger than that for Taiwan Mandarin \(see Figure[3](https://arxiv.org/html/2606.26360#S4.F3)above\)\.

Figure[8](https://arxiv.org/html/2606.26360#S4.F8)presents the predicted pitch contours for those that are attested in both the Beijing and Taiwan Mandarin datasets\. The upper panel displays 27 words whose second syllable bears a neutral tone in both varieties\. These words are organized by their morpho\-syntactic structure, including reduplicated kinship terms such as 妈妈 \(ma1ma5, ‘mom’\), reduplicated verbs such as 看看 \(kan4kan5‘have a look’\), plural suffix constructions \(e\.g\., X\+们men5, plural marker\), nominal suffix constructions \(e\.g\., X\+子zi5, nominal suffix\), structural particle constructions \(e\.g\., X\+的de5, structural particle\), aspectual or other function\-word constructions \(e\.g\., 了le5, aspect marker\), and lexicalized neutral\-tone words such as 东西 \(dong1xi5, ‘thing’\)\. The lower panel presents 16 words the second syllable of which, according to standard descriptions, is realized with a neutral tone in Beijing Mandarin but not in Taiwan Mandarin\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x9.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.26360v1/x10.png)\(b\)

Figure 8:Predicted pitch contours for individual words, estimated by combining the partial effect of the factor smooth forword, the partial effect oftone pattern, and the model intercepts\. Predictions are based on separate GAMs fitted to the Beijing and Taiwan Mandarin datasets and are overlaid to facilitate comparison\. Words that bear neutral tone in both varieties are shown in the upper panel\. Words that are realized with neutral tone in Beijing Mandarin but not in Taiwan Mandarin are shown in the lower panel\. In the lower panel, the pinyin transcription follows the Taiwan Mandarin standard\.Word\-specific tone signatures are clearly visible when comparing words with the same tone pattern, such as 弟弟 \(panel a, 4\), 爸爸 \(panel a, 6\), 太太 \(panel a, 27\) and 记得 \(panel a, 21\)\. Focusing on Beijing Mandarin \(red curves\), for 弟弟 we observe a clearly falling pitch contour, for 爸爸 a nearly level pitch contour, for 太太 the second syllable shows a clear rise, whereas for 记得, the pitch contour is inverse U\-shaped\.

In the present dataset, there is one heterographic homophone:ta1men5\(‘they’\), represented by 他们 \(‘they, male or mixed gender’\) and 她们 \(‘they, female’\)\. In Beijing Mandarin \(indicated by the red curve\), 他们 exhibits a relatively flat pitch contour, whereas 她们 shows a slightly rising contour\.

Turning to Taiwan Mandarin \(blue curves\), some words have realizations that clearly diverge from the dictionary standards\. For example, 早上 \(zao3shang5, ‘morning’, panel a, 25\) is realized with a fall\-rise\-fall contour that can be interpreted as the dipping tone of 早 followed by the falling tone of 上 \(as preposition or verb\) rather than the neutral tone prescribed by the dictionary\. The fall\-rise\-fall contour diverges markedly from the dipping contour of the T3\-T5 pattern in Taiwan Mandarin \(shown in Figure[6](https://arxiv.org/html/2606.26360#S4.F6)\)\. 爷爷 \(panel a, 5\) shows a final rise that is much stronger than expected given the T2\-T5 tone pattern \(which is basically a mid level tone \(see Figure[6](https://arxiv.org/html/2606.26360#S4.F6)\)\. 你们 \(ni3men5, ‘you\.PL’, panel a, 9\) displays a dipping contour at the end of the F0 trajectory, instead of a dipping contour spread out across two syllables\. Possibly, the final rise is due to rising tone of 们 \(men2, ‘door, opening’\)\. For the one homophone pair in our dataset, in Taiwan Mandarin, 她们 \(ta1men5, ‘they, female’\) is realized with a higher pitch than 他们 \(ta1men5, ‘they, male or mixed gender’\), consistent with patterns reported for the monosyllables 她 \(ta1, ‘she’\) and 他 \(ta1, ‘he’\) for this variety\[[25](https://arxiv.org/html/2606.26360#bib.bib13)\]\.

Words that are described as having a neutral tone in Beijing Mandarin and a lexical tone in Taiwan Mandarin also exhibit remarkable word\-specific signatures\. Consider, for instance, 休息 \(xiu1xi2, ‘rest’, panel b, 15\)\. In Beijing Mandarin, this word is pronounced with an f0 contour that is similar to the pitch component of the T1\-T5 tone pattern: an initial rise followed by a gentle fall\. For Taiwan Mandarin, instead of a T1\-T2 tone pattern, we find a rise\-fall pitch contour\. For 告诉 \(gao4su4, ‘tell’, panel b, 2\) and 厉害 \(li4hai4, ‘serious/awesome’, panel b, 3\), the expected tone pattern is a rise\-fall for Beijing Mandarin \(see Figure[6](https://arxiv.org/html/2606.26360#S4.F6), left panel\), but for these words, the word specific pitch signatures superimpose a dipping pattern \(告诉\) or a general downward pitch trend \(厉害\)\. By contrast, in Taiwan Mandarin, there is no clear evidence for a falling pitch contour on the second syllable\. For 告诉, the overall pattern is that of a dipping tone that ends at the mid\-to\-low final pitch that is characteristic for this dialect\. However, for 厉害, the second syllable carries a weak dipping tone, rather than a falling tone\.

Words with the most markedly different tonal realizations are 爷爷 \(panel a, 5\), 她们 \(panel a, 11\), 样子 \(panel a, 15\), 早上 \(panel a, 25\) and 太太 \(panel a, 27\)\. More extreme differences are seen for words that have been reported to have floating tones in Beijing Mandarin and lexical tones in Taiwan Mandarin: 漂亮 \(panel b, 6\), 朋友 \(panel b, 5\), 厉害 \(panel b, 3\), 喜欢 \(panel b, 13\) and 休息 \(panel b, 15\)\. But there are also words that exhibit remarkably similar contours across the two varieties\. Examples include 妈妈 \(ma1ma5, ‘mom’, panel a, 1\), 哥哥 \(ge1ge5, ‘brother’, panel a, 2\), and 我们 \(wo3men5, ‘we’, panel a, 8\)\. Even for 便宜 \(pian2yi5, ‘cheap’, panel b, 1\), which dictionaries expect to differ across dialects, have fairly similar predicted tonal realizations\.

Overall, words bearing a neutral tone in both varieties exhibit more similar pitch contours than words that have a neutral\-toned only in Beijing Mandarin\. This difference emerges clearly from an inspection of the differences in the summed squared errors \(SSD\) between the predicted pitch contours of Beijing and Taiwan Mandarin for each word pair\. Figure[9](https://arxiv.org/html/2606.26360#S4.F9)shows the distribution of SSD values across the two groups\. The orange box represents words that bear a neutral tone \(T5\) in both Beijing and Taiwan Mandarin, while the green box represents words whose second syllable bears a neutral tone in Beijing Mandarin but not in Taiwan Mandarin\. The mean SSD for words bearing a neutral tone in both varieties was 0\.07, whereas the mean SSD for words bearing a neutral tone only in Beijing Mandarin was 0\.10 \(t=−2\.2082t=\-2\.2082,df=23\.668df=23\.668,p−value=0\.03717p\-value=0\.03717\)\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x11.png)Figure 9:Distribution of mean squared difference between word type pairs\. The orange box indicates words bearing T5 in both Beijing and Taiwan Mandarin\. The green box indicates words in which the second syllable is bearing T5 in Beijing Mandarin but not in Taiwan Mandarin\.In summary, with respect to the research question of whetherwords have their own tonal signatures not only in Taiwan Mandarin but also in Beijing Mandarin, the evidence unequivocally supports tonal signatures across both dialects\.

### 4\.3Meaning

In this section, we turn to the fourth and final research question, examining whether the word\-specific tonal signatures observed in the preceding section are linked to the meanings of words in context\. To do so, we calculated meaning vectors for the work tokens, and examined whether these meaning vectors are predictive for words’ pitch contours\.

We operationalized words’ meanings by means of contextualized embeddings, henceforth CEs\. The embeddings in the current study were derived from Qwen\-2\.5, a family of large language models developed by Alibaba\[[43](https://arxiv.org/html/2606.26360#bib.bib50)\]\. The embedding of a word token was computed for the specific context in which that token occurred\. Thus, each word token was associated with an 896\-dimensional vector\. For the Beijing Mandarin dataset, which is annotated with simplified Chinese characters, we generated a set of embeddings using simplified Chinese input\. The embeddings were brought together as the row vector of a matrixSbeijingS\_\{\\text\{beijing\}\}\. For the Taiwan Mandarin dataset, which is annotated in traditional Chinese characters, we generated two sets of embeddings,Staiwan\_TCS\_\{\\text\{taiwan\\\_TC\}\}, based on the original traditional Chinese annotations, andStaiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}, obtained by converting traditional characters into simplified Chinese before extracting embeddings\. Each of these three matrices have dimensionsn×896n\\times 896, wherennis the number of tokens in the dataset\.Staiwan\_TCS\_\{\\text\{taiwan\\\_TC\}\}is used as the semantic representation of the Taiwan Mandarin dataset in the subsequent modeling\.Staiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}is used only for the visualization in Figure[10](https://arxiv.org/html/2606.26360#S4.F10)\. SinceSbeijingS\_\{\\text\{beijing\}\}andStaiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}reside in the same semantic space — they are both calculated for simplified characters — we can consider these CEs jointly\.

For tracing possible differences in meanings across the two dialects, we reduced the 896\-dimensional semantic space to two dimensions using t\-SNE\[[49](https://arxiv.org/html/2606.26360#bib.bib9)\]\. Figure[10](https://arxiv.org/html/2606.26360#S4.F10)shows the 10 most frequent word types shared by both datasets, labeled with the numbers 1 – 10\. Tokens fromSbeijingS\_\{\\text\{beijing\}\}are indicated by orange numbers, and tokens fromStaiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}are indicated by blue numbers\. Convex hulls are used to highlight clusters corresponding to different word types\.

For most words, a clear single cluster is present that contains tokens from both the Beijing and Taiwan corpora, the only exception being 朋友 \(peng2you3, ‘friend’\), which forms two distinct clusters in the upper right quadrant of the t\-SNE plane\. The cluster on the left, close to 1, corresponds to abstract or relational uses, where 朋友 denotes the existence or formation of social ties \(e\.g\., making friends or having friends\)\. The cluster on the right reflects comitative uses, in which 朋友 appears as a co\-participant in social activities \(e\.g\., going out or engaging in activities with friends\)\. For example, in the relational sense, 朋友 occurs in expressions such as 能交上朋友 \(‘to be able to make friends’\) or 因为上学的一些朋友 \(‘friends made through school’\), where the focus is on the establishment or existence of social relationships rather than shared actions\. By contrast, in the comitative sense, 朋友 appears in contexts describing joint activities, such as 周末会专门约朋友出去 \(‘\[one\] would specifically arrange to meet friends on weekends’\) or 跟朋友一起逛街、运动之类的 \(‘going shopping or exercising together with friends’\), where the emphasis is on shared experiences and social interaction\.

It is noteworthy that within several clusters, the tokens of the two dialects are not randomly distributed\. For the clusters of 有的 \(5\) and 除了 \(10\), for instance, the tokens from the Beijing corpus have higher values on the vertical dimension compared to the tokens from the Taiwan corpus, suggesting that there are measurable differences in the senses of these words in the two dialects\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x12.png)Figure 10:Contextualized embeddingsSbeijingS\_\{\\text\{beijing\}\}andStaiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}, obtained from a pretrained Chinese Qwen\-2\.5 model, are shown in a two\-dimensional plane obtained with t\-SNE\. Convex hulls \(polygons\) highlight the clusters of the top\-10 frequent word types, labeled by the numbers 1–10\. Orange numbers represent tokens inSbeijingS\_\{\\text\{beijing\}\}, and blue numbers represent tokens inStaiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}\.Pitch contours were represented as fixed\-length vectors in matricesCbeijingC\_\{\\text\{beijing\}\}andCtaiwanC\_\{\\text\{taiwan\}\}, each of sizen×100n\\times 100\. Each row corresponds to one token, and the 100 columns of a row vector represent the 100 time\-normalized F0 measurements for that token\. These token\-specific pitch vectors were denoised using the best\-fit GAMs described in Section[4](https://arxiv.org/html/2606.26360#S4), which take into account a wide range of factors that co\-determine tonal realization\. Following\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\], we extracted the smooth for normalized time, the word\-specific smooths, and other smooth terms, and added these to obtain denoised word\-specific pitch signatures\.

For each variety, the dataset was randomly split into a training dataset \(90%\) and a testing dataset \(10%\), with the constraint that the testing dataset set did not contain tokens of types that it had not encountered during training\. A linear mapping was trained to project contextualized embeddings \(SS\) onto the corresponding pitch contours \(CC\):SbeijingS\_\{\\text\{beijing\}\}toCbeijingC\_\{\\text\{beijing\}\}for Beijing Mandarin, andStaiwan\_SCS\_\{\\text\{taiwan\\\_SC\}\}toCtaiwanC\_\{\\text\{taiwan\}\}for Taiwan Mandarin\. Model performance was evaluated using a nearest\-neighbor classification approach\. For the predicted contour of a token, its closest neighbor was identified using L2 distance\. A prediction was considered as correct if the nearest neighbor matched the target word type; otherwise, it was considered as incorrect\. The entire procedure was repeated over 30 random permutations, and mean accuracy was calculated\.

The mean accuracy for Beijing Mandarin was 36\.00% for the training dataset and 23\.97% for the testing dataset \(30\-run global permutation baseline: 10\.01%\)\. For Taiwan Mandarin, the mean accuracy across 30 permutations was 33\.32% for the training dataset and 20\.34% for the testing dataset, with a permutation baseline of 10\.95%\. In both datasets, performance was well above the permutation baseline, indicating that tonal realizations of neutral tone can be predicted by their meaning in context with above\-chance accuracy\.

Figure[11](https://arxiv.org/html/2606.26360#S4.F11)compares observed and predicted pitch contours for the ten words shown in Figure[10](https://arxiv.org/html/2606.26360#S4.F10)\. Predicted contours were obtained by giving the centroid embedding of each word as an input to the linear mapping\. For both Beijing and Taiwan Mandarin, the DLM\-predicted contours generally resemble the observed trajectories, although the degree varies across words\. Some words exhibit similar contour shapes across the two varieties, such as 事情 and 有的, whereas others show more cross\-variety differences in their observed realizations \(e\.g\.,孩子\)\. Overall, the results indicate that the embedding\-based representations capture substantial information about the shapes of word\-specific pitch contours\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x13.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.26360v1/x14.png)\(b\)

Figure 11:Comparison of observed \(dashed lines\) and predicted pitch contours \(solid lines\) for the ten words shown in Figure[10](https://arxiv.org/html/2606.26360#S4.F10)\. The upper panel presents results for Beijing Mandarin, and the lower panel presents results for Taiwan Mandarin\. Predicted contours were obtained by using the centroid of the word as input to the linear mapping\. Observed contours are reproduced from the predicted pitch contours shown in Figure[8](https://arxiv.org/html/2606.26360#S4.F8)and rescaled to the same scale as the DLM predictions\.In addition, we computed for each tone pattern the centroids of the CEs of the tokens belonging to that tone pattern\. For each dialect separately, the resulting four centroid vectors, the ‘prototypical meanings’ of the tone patterns, were given as input to the linear mapping, resulting in four predicted pitch contours\. Figure[12](https://arxiv.org/html/2606.26360#S4.F12)compares these CE\-predicted contours \(solid lines\) with the partial effect pitch curves obtained with the best\-fit GAMs for tone pattern \(dashed lines, cf\. Figure[6](https://arxiv.org/html/2606.26360#S4.F6)\)\. Two things are noteworthy\. First, there is considerable similarity between the CE\-predicted curves and the GAM\-based partial effect curves\. Given that the CEs are obtained with a large language model that is not tuned to the individual experiences of the speakers in the two corpora, perfect fits cannot be expected\. Second, some of the differences between the tone\-pattern contours of Beijing and Taiwan Mandarin may be due to subtle differences in the meanings\-in\-context of the word tokens in the two dialects, as exemplified in Figure[10](https://arxiv.org/html/2606.26360#S4.F10), especially in the case of the T1\-T5 and T4\-T5 tone patterns, for which the fits of the CE\-predicted contours and the GAM\-based observed contours are very similar within each dialect while being clearly different between dialects\.

![Refer to caption](https://arxiv.org/html/2606.26360v1/x15.png)\(a\)
![Refer to caption](https://arxiv.org/html/2606.26360v1/x16.png)\(b\)

Figure 12:CE\-predicted pitch contours for tone patterns \(solid lines\) and GAM\-estimated partial effects for tone pattern \(dashed lines\) for Beijing Mandarin \(upper panel\) and Taiwan Mandarin \(lower panel\)\. For ease of comparison, the GAM\-predicted contours are linearly rescaled to match the range of the CE\-predicted contours\.With respect to our fourth research question, which asks whethercontextualized embeddings for Beijing conversational speech are predictive for words’ tonal signatures as previously observed for Taiwan Mandarin, our results provide clear support for the possibility that words’ pitch contours are in part co\-determined by their meanings\. Furthermore, we have shown that the differences in the realization of the tone patterns across the two dialects may in part arise due to dialect\-specific differences in words’ contextualized meanings\.

## 5General discussion

The neutral, or floating, tone of Mandarin Chinese is an enigmatic tone, that has been described as a reduced tone, a flexible tone that can be on a continuum from being a lexical tone to toneless, and as a tone that is highly context\-dependent\[[13](https://arxiv.org/html/2606.26360#bib.bib27), see, e\.g\.,\]\. In this study, we investigated the realization of two\-syllable words with a neutral second tone, in corpora of conversational Mandarin recorded in Beijing and in Taiwan, focusing on the following four questions\.

1. 1\.What is the nature of the neutral/floating tone?For two\-syllable word, the floating tone turns out to be characterized by a mid\-to\-low final tone target, which dovetails well with the findings of\[[28](https://arxiv.org/html/2606.26360#bib.bib36),[24](https://arxiv.org/html/2606.26360#bib.bib34)\]\. At the same time, each tone pattern has its own tonal characteristics\. Therefore, the floating tone in two\-syllable words is not different in principle from other lexical tones in second syllables\[[38](https://arxiv.org/html/2606.26360#bib.bib11), cf\.\]\.One possible interpretation is that the neutral tone is not a single lexical tone but rather a mixture of several distinct underlying tones\. However, this possibility is not unique to the neutral tone and may also apply to other lexical tones\. More importantly, when the neutral tone is treated as a single lexical category, as we do here, it exhibits the same kind of systematic variability as the other lexical tones, with an underlying unifying tonal pattern\. This leads to the conclusion that the neutral \(or floating\) tone is a lexical tone, characterized by a mid\-to\-low final F0 target\.
2. 2\.Do tone patterns of words with the floating tone differ between Beijing and Taiwan Mandarin?A comparison of Beijing and Taiwan Mandarin revealed that words with the T1\-T5 and T4\-T5 tone patterns have an initial rise that is absent in Taiwan Mandarin\. In Taiwan Mandarin, words with the T2\-T5 tone pattern are realized with a nearly flat tone contour, which contrasts with solid evidence for a rise in the T2\-T5 words as realized in the Beijing corpus\. Across the two dialects, the realization of the T3\-T5 is most similar, and resembles a drawn\-out dipping tone\. Furthermore, there is substantially more variability in the realization of words with a floating tone in Taiwan Mandarin, as compared to Beijing Mandarin\. The high variability that characterizes Taiwan floating tones is consistent with the findings of, e\.g\.,\[[24](https://arxiv.org/html/2606.26360#bib.bib34)\]\.
3. 3\.Do words with a floating tone have word\-specific tonal signatures?In line with previous corpus\-based studies of the realization of tones in conversational Mandarin\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[38](https://arxiv.org/html/2606.26360#bib.bib11),[25](https://arxiv.org/html/2606.26360#bib.bib13)\], solid evidence was obtained for word\-specific pitch signatures, replicating\[[38](https://arxiv.org/html/2606.26360#bib.bib11)\]for Taiwan Mandarin, and extending the evidence to another variety, Beijing Mandarin\. The variable importance of word\-specific smooths was even greater for Beijing Mandarin as compared to Taiwan Mandarin\. Word\-specific tone signatures can be much stronger than the pitch components tied to tone patterns, which helps explain why\[[13](https://arxiv.org/html/2606.26360#bib.bib27)\]concluded that the neutral tone is a very flexible tone shaped by a variety of contextual factors\. Importantly, the present study takes a wide range of contextual factors into account, including the substantial effect of the tones of adjacent syllables in the context\. Nevertheless, thanks to the ability of the generalized additive model to decompose a pitch contour into an additive set of partial contours that are tied to linguistic predictors, it is now possible to isolate in empirical pitch contours not only the components that are due to neighboring tones, position in the sentence, speaker, and probability in context, but also components linked to tone pattern and, importantly, word identity\.
4. 4\.Are word\-specific pitch signatures predictable from their meanings in context?As previous studies have provided evidence that word specific pitch signatures are linked to the meanings of words in context, following\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[38](https://arxiv.org/html/2606.26360#bib.bib11),[25](https://arxiv.org/html/2606.26360#bib.bib13)\], we examined whether words’ contextualized embeddings are predictive for words’ pitch signatures\. A simple linear transformation from contextualized embeddings to pitch contours indeed predicts words’ signatures with an accuracy that far exceeds a permutation baseline, in line with the predictions of the Discriminative Lexicon Model\[[1](https://arxiv.org/html/2606.26360#bib.bib17),[19](https://arxiv.org/html/2606.26360#bib.bib18)\]\. Furthermore, the prototypical meanings of the tone patterns, estimated by the centroids of their contextualized embeddings, predict pitch contour signatures that are remarkably similar to the pitch components isolated by the GAM for the tone patterns\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[38](https://arxiv.org/html/2606.26360#bib.bib11), see also\]\. Similarities are especially striking for the T1\-T5 and T4\-T5 tone patterns, and capture well the differences in the realization of these tone patterns in the two dialects\. This points to the possibility that differences in the realization of tone between the two dialects are in part due to differences in the senses of these words in context\.

The actual tonal realizations in the corpus of conversational Taiwan Mandarin diverge considerably from the Taiwan dictionary norms\. Words that are described as having a lexical tone on the second syllable \(corresponding to a neutral tone in standard Mandarin and Beijing Mandarin\) can have pitch contours in colloquial language use that bear no resemblance to the lexical tones found in the dictionary, while also diverging considerably from their tone\-patterns as identified by the GAM\. Compared to the conversational Beijing Mandarin sampled by our corpus, there is more cross\-dialect variation for words marked as having a neutral tone in this dialect, but marked as having a lexical tone in Taiwan, compared to words that are described as having a neutral tone in both dialects\.

Two directions for future research seem especially promising\. First, the pitch contour is not the only acoustic correlate of the neutral tone\. Neutral tones, at least in laboratory speech, tend to have shorter duration\. Future work could therefore examine whether differences in meaning can be reflected in duration\. Evidence for the effect of meaning \(using embeddings\) on spoken word duration has been reported for English\[[16](https://arxiv.org/html/2606.26360#bib.bib16)\]\. Second, previous studies have investigated single\-syllable words\[[25](https://arxiv.org/html/2606.26360#bib.bib13)\]and bisyllabic words\[[11](https://arxiv.org/html/2606.26360#bib.bib14),[38](https://arxiv.org/html/2606.26360#bib.bib11)\], but it is currently unclear whether word\-specific pitch signatures are also present for three\-syllable words\. Previous research suggests that the pitch realization of trisyllabic sequences differs from that of disyllabic forms\[[53](https://arxiv.org/html/2606.26360#bib.bib2)\]\. Specifically tri\-syllabic sequences with two adjacent neutral tones will likely be both very interesting, but also highly challenging due to data sparsity in current corpora of spontaneous conversational speech\.

To conclude, the present investigation shows that in colloquial conversational Mandarin as spoken in Beijing and Taiwan, the neutral or floating tone as found in two\-syllable words is highly similar to the lexical tones in many ways\. The neutral tone has its own pitch target, just as do the other tones\. Words with the neutral tone have tone patterns, exactly as observed for the lexical tone\. Words with the neutral tone have their own pitch signatures, which are to some extent predictable from their contextualized embeddings, mirroring what has been found for the lexical tones\. The only way in which neutral tones differ from lexical tones is that neutral tones in di\-syllabic words are restricted to the second syllable\. From a meta\-theoretical perspective, there is one further difference: the neutral tone is the only tone the variability of which has been extensively commented on, even though similar variability is widespread among the lexical tones\.

## Funding

This work was supported by the European Research Council under Grant SUBLIMINAL \(\#101054902\) awarded to R\. Harald Baayen\.

## Declaration

The authors declare no conflicts of interest\.

## Data availability

## Appendix

Table A\.1:Model summary of best\-fit GAM fitted to the dataset of Beijing Mandarin\.Table A\.2:Model summary of best\-fit GAM fitted to the dataset of Taiwan Mandarin\.
## References

- \[1\]\(2019\)The discriminative lexicon: A unified computational model for the lexicon and lexical processing in comprehension and production grounded not in \(de\) composition but in linear discriminative learning\.Complexity2019\.External Links:[Document](https://dx.doi.org/10.1155/2019/4895891)Cited by:[item 4](https://arxiv.org/html/2606.26360#S5.I1.i4.p1.1)\.
- \[2\]P\. Boersma and D\. Weenink\(2020\)Praat: doing phonetics by computer \[Computer program\]\. Version 6\.0\. 37, 2018\.Cited by:[§3\.3](https://arxiv.org/html/2606.26360#S3.SS3.p1.1)\.
- \[3\]J\. Cao\(1986\)Putonghua qingsheng yinjie texing fenxi\.Applied Acoustics5\(4\),pp\. 1–6\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p2.1)\.
- \[4\]J\. Cao\(1992\)On neutral\-tone syllables in Mandarin Chinese\.Canadian Acoustics20\(3\),pp\. 49–50\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[5\]Y\. R\. Chao\(1932\)A preliminary study of English intonation \(with American variants\) and its Chinese equivalents\.InThe Tsai Yuan P’ei anniversary volume \(supplementary Vol\. I of the Bulletin of the Institute of History and Philology\),Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[6\]Y\. R\. Chao\(1968\)A grammar of spoken Chinese\.\.Cited by:[§1](https://arxiv.org/html/2606.26360#S1.p1.1),[§2](https://arxiv.org/html/2606.26360#S2.p2.1),[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[7\]M\. Y\. Chen\(2000\)Tone sandhi: Patterns across Chinese dialects\.Cambridge Studies in Linguistics,Cambridge University Press\.External Links:[Document](https://dx.doi.org/10.1017/CBO9780511486364)Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[8\]Y\. Chen and Y\. Xu\(2006\)Production of weak elements in speech – evidence from F0 patterns of neutral tone in Standard Chinese\.Phonetica63\(1\),pp\. 47–75\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p2.1)\.
- \[9\]Chinese Academy of Social Sciences\(2016\)Xiandai Hanyu Cidian\[the contemporary chinese dictionary\]\.7 edition,Commercial Press\.Cited by:[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p3.1)\.
- \[10\]E\. Chirkova and Y\. Chen\(2011\)Beijing Mandarin, the language of Beijing\.HAL Id: hal\-00724219\.Cited by:[item 2](https://arxiv.org/html/2606.26360#S1.I1.i2.p1.1.2),[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p1.1)\.
- \[11\]Y\. Chuang, M\. J\. Bell, Y\. Tseng, and R\. H\. Baayen\(2026\)Word\-specific tonal realizations in Mandarin\.Language\.External Links:[Document](https://dx.doi.org/10.1017/S0097850725000001)Cited by:[item 4](https://arxiv.org/html/2606.26360#S1.I1.i4.p1.1),[§1](https://arxiv.org/html/2606.26360#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.26360#S2.SS3.p1.1),[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p4.1),[item 3](https://arxiv.org/html/2606.26360#S5.I1.i3.p1.1),[item 4](https://arxiv.org/html/2606.26360#S5.I1.i4.p1.1),[§5](https://arxiv.org/html/2606.26360#S5.p3.1),[footnote 3](https://arxiv.org/html/2606.26360#footnote3)\.
- \[12\]Y\. Chuang, J\. Fon, I\. Papakyritsis, and H\. Baayen\(2021\)Analyzing phonetic data with generalized additive mixed models\.InManual of clinical phonetics,pp\. 108–138\.Cited by:[§4](https://arxiv.org/html/2606.26360#S4.p1.1)\.
- \[13\]X\. Dong, F\. Liu, C\. Lin, M\. Nesbitt, and S\. Shi\(2025\)Neutral tone variation in Beijing Mandarin: is neutral tone toneless?\.InProceedings of Interspeech 2025,pp\. 694–698\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p2.1),[item 3](https://arxiv.org/html/2606.26360#S5.I1.i3.p1.1),[§5](https://arxiv.org/html/2606.26360#S5.p1.1)\.
- \[14\]J\. J\. Dreher and P\. Lee\(1968\)Instrumental investigation of single and paired Mandarin tonemes\.Monumenta serica27\(1\),pp\. 343–373\.External Links:[Document](https://dx.doi.org/10.1080/02549948.1968.11731059)Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[15\]J\. Fon\(2004\)A preliminary construction of Taiwan Southern Min spontaneous speech corpus\.Technical reportTech\. Rep\. NSC\-92\-2411\-H\-003\-050, National Science Council, Taiwan\.Cited by:[§3\.1](https://arxiv.org/html/2606.26360#S3.SS1.p2.1)\.
- \[16\]S\. Gahl and R\. H\. Baayen\(2024\)*Time*and*thyme*again: connecting spoken word duration in English to models of the mental lexicon\.Language100\(4\),pp\. 623–670\.External Links:[Document](https://dx.doi.org/10.1353/lan.2024.a947037)Cited by:[§5](https://arxiv.org/html/2606.26360#S5.p3.1)\.
- \[17\]J\. Goldman\(2011\)EasyAlign: an automatic phonetic alignment tool under Praat\.\.InInterspeech,Vol\.12,pp\. 3233–3236\.Cited by:[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p2.1)\.
- \[18\]M\. Grootendorst\(2022\)BERTopic: neural topic modeling with a class\-based tf\-idf procedure\.arXiv preprint arXiv:2203\.05794\.Cited by:[§3\.1](https://arxiv.org/html/2606.26360#S3.SS1.p4.1)\.
- \[19\]M\. Heitmeier, Y\. Chuang, and R\. H\. Baayen\(2026\)The discriminative lexicon: theory, implementation in the Julia package JudiLing, and applications\.Cambridge University Press\.Cited by:[item 4](https://arxiv.org/html/2606.26360#S5.I1.i4.p1.1)\.
- \[20\]F\. Hsieh and C\. Chuang\(2008\)A study of the phonetics and phonology of neutral tones in Urumqi Chinese\.USTWPL4,pp\. 57–71\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
- \[21\]H\.\-C\. Hsu\(2006\)Revisiting tone and prominence in Chinese\.Language and Linguistics7\(1\),pp\. 109–1137\.Cited by:[§2\.2](https://arxiv.org/html/2606.26360#S2.SS2.p3.1)\.
- \[22\]M\. Hu\(1987\)Beijinghua chutan\. A preliminary study of the Peking dialect\.Commercial Press,Bejing\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p2.1)\.
- \[23\]J\. Huang and A\. Li\(2023\-07\)轻声与非轻声之间轻重的连续统关系\.pp\.\.External Links:[Document](https://dx.doi.org/10.13724/j.cnki.ctiw.2023.03.008)Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p5.1)\.
- \[24\]K\. Huang\(2018\)Phonological identity of the neutral\-tone syllables in Taiwan Mandarin: an acoustic study\.Acta Linguistica Asiatica8\(2\),pp\. 9–50\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1),[§2\.2](https://arxiv.org/html/2606.26360#S2.SS2.p1.1),[§2\.2](https://arxiv.org/html/2606.26360#S2.SS2.p2.1),[§4\.1](https://arxiv.org/html/2606.26360#S4.SS1.p4.1),[item 1](https://arxiv.org/html/2606.26360#S5.I1.i1.p1.1),[item 2](https://arxiv.org/html/2606.26360#S5.I1.i2.p1.1)\.
- \[25\]X\. Jin, M\. Ernestus, and R\. H\. Baayen\(2026\)A new kid on the block: Distributional semantics predicts the word\-specific tone signatures of monosyllabic words in conversational Taiwan Mandarin speech\.Journal of Phonetics\.External Links:[Document](https://dx.doi.org/10.1016/j.wocn.2026.101495)Cited by:[§1](https://arxiv.org/html/2606.26360#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.26360#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.26360#S4.SS2.p5.1),[item 3](https://arxiv.org/html/2606.26360#S5.I1.i3.p1.1),[item 4](https://arxiv.org/html/2606.26360#S5.I1.i4.p1.1),[§5](https://arxiv.org/html/2606.26360#S5.p3.1)\.
- \[26\]C\. C\. Kubler\(1985\)The influence of southern min on the Mandarin of Taiwan\.Anthropological Linguistics27\(2\),pp\. 156–176\.Cited by:[§2\.2](https://arxiv.org/html/2606.26360#S2.SS2.p1.1)\.
- \[27\]W\. Lee and Z\. Eric\(2008\)Prosodic characteristics of the neutral tone in Beijing Mandarin/ 北京话轻声的韵律特征\.Journal of Chinese Linguistics,pp\. 1–29\.External Links:[Document](https://dx.doi.org/https%3A//www.jstor.org/stable/23754104)Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[28\]W\. Lee\(2003\)A phonetic study of the neutral tone in Beijing Mandarin\.InProceedings of the 15th International Congress of Phonetic Sciences \(ICPHS 2003\), Barcelona,Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p3.1),[item 1](https://arxiv.org/html/2606.26360#S5.I1.i1.p1.1)\.
- \[29\]J\. Li\(2005\)The preliminary study about neutral tone: dialect effect between north official Mandarin speakers in China and Taiwan Mandarin speakers\.The Journal of the Acoustical Society of America117\(4\_Supplement\),pp\. 2457–2457\.External Links:[Document](https://dx.doi.org/10.1121/1.4787192)Cited by:[§2\.2](https://arxiv.org/html/2606.26360#S2.SS2.p1.1)\.
- \[30\]Q\. Li and Y\. Chen\(2019\)Prosodically conditioned neutral\-tone realization in Tianjin Mandarin\.Journal of East Asian Linguistics28\(3\),pp\. 211–242\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
- \[31\]Y\. Li and Z\. Wu\(2018\)Disyllabic tone sandhi and neutral tone patterns in Yichang dialect\.In2018 International Conference on Asian Language Processing \(IALP\),pp\. 1–7\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
- \[32\]Y\. Li and A\. L\. Thompson\(2016\)Production of neutral tones in three Mandarin dialects\.Journal of the Acoustical Society of America140\(4\_Supplement\),pp\. 3397–3397\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
- \[33\]Y\. Li, C\. T\. Best, M\. D\. Tyler, and D\. Burnham\(2020\)Tone variations in regionally accented Mandarin\.InInterspeech,pp\. 4158–4162\.Cited by:[§1](https://arxiv.org/html/2606.26360#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p4.1)\.
- \[34\]M\. Lin and J\. Yan\(1980\)Beijinghua qingsheng de shengxue xingzhi \(the acoustic features of the neutral tone in Beijing dialect\) 北京话轻声的声学性质\.Fangyan3,pp\. 166–178\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[35\]F\. Liu and Y\. Xu\(2007\)The neutral tone in question intonation in Mandarin\.InInterspeech 2007,pp\. 630–633\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p4.1)\.
- \[36\]Y\. Lu\(1995\)Putonghua de qingsheng he erhua \[neutral tone and rhotacization in Standard Mandarin\]\.Shangwu Yinshuguan\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p3.1)\.
- \[37\]Y\. Lu, Y\. Chuang, and R\. H\. Baayen\(2026\)Form and meaning co\-determine the realization of tone in Taiwan Mandarin spontaneous speech: the case of T2\-T3 and T3\-T3 tone sandhi\.To appear in Journal of Chinese Linguistics\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2408.15747)Cited by:[§1](https://arxiv.org/html/2606.26360#S1.p2.1),[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p4.1)\.
- \[38\]Y\. Lu, Y\. Chuang, and R\. H\. Baayen\(2026\)The realization of tones in spontaneous spoken Taiwan Mandarin: a corpus\-based survey and theory\-driven computational modeling\.Corpus Linguistics and Linguistic Theory\.External Links:[Document](https://dx.doi.org/10.1515/cllt-2025-0028)Cited by:[item 1](https://arxiv.org/html/2606.26360#S1.I1.i1.p1.1),[item 4](https://arxiv.org/html/2606.26360#S1.I1.i4.p1.1),[§1](https://arxiv.org/html/2606.26360#S1.p2.1),[§1](https://arxiv.org/html/2606.26360#S1.p3.1),[Figure 2](https://arxiv.org/html/2606.26360#S2.F2),[Figure 2](https://arxiv.org/html/2606.26360#S2.F2.7.1.1),[§2\.3](https://arxiv.org/html/2606.26360#S2.SS3.p1.1),[§2\.3](https://arxiv.org/html/2606.26360#S2.SS3.p2.1),[§2\.3](https://arxiv.org/html/2606.26360#S2.SS3.p3.1),[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p4.1),[§4\.1](https://arxiv.org/html/2606.26360#S4.SS1.p6.1),[§4\.3](https://arxiv.org/html/2606.26360#S4.SS3.p6.3),[item 1](https://arxiv.org/html/2606.26360#S5.I1.i1.p1.1),[item 3](https://arxiv.org/html/2606.26360#S5.I1.i3.p1.1),[item 4](https://arxiv.org/html/2606.26360#S5.I1.i4.p1.1),[§5](https://arxiv.org/html/2606.26360#S5.p3.1)\.
- \[39\]C\. Luo and J\. Wang\(1957\)Putong yuyinxue gangyao 普通語音學綱要 \[outline of general phonetics\] \(revised ed\.\)\.Beijing: Kexue Chubanshe\.Note:\(Reprinted and partially revised in 2002 by Beijing: Shangwu Yinshuguan\)Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[40\]W\. Ma and K\. Chen\(2003\)Introduction to CKIP chinese word segmentation system for the first international Chinese word segmentation bakeoff\.InProceedings of the second SIGHAN workshop on Chinese language processing,pp\. 168–171\.Cited by:[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p2.1)\.
- \[41\]M\. McAuliffe, M\. Socolof, S\. Mihuc, M\. Wagner, and M\. Sonderegger\(2017\)Montreal forced aligner: trainable text\-speech alignment using kaldi\.\.InInterspeech,Vol\.2017,pp\. 498–502\.Cited by:[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p1.1)\.
- \[42\]X\. Qian\(1985\)论普通话的轻声词和儿化词 \[on neutral tone words and erhua in Mandarin\]\.深圳大学学报: 人文社会科学版\(3\),pp\. 74–82\.Cited by:[item 2](https://arxiv.org/html/2606.26360#S1.I1.i2.p1.1.2),[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p1.1)\.
- \[43\]Qwen\(2025\)Qwen2\.5 technical report\.External Links:2412\.15115,[Link](https://arxiv.org/abs/2412.15115)Cited by:[§4\.3](https://arxiv.org/html/2606.26360#S4.SS3.p2.9)\.
- \[44\]F\. Ruan, Q\. Song, K\. Li, and Y\. Hao\(2018\)Definition of Corpus, scripts, standards and Specifications of environment/speaker coverage for Mandarin languages\.\.Technical reportBeijing Haitian Ruisheng Science Technology Ltd\.\.Cited by:[§3\.1](https://arxiv.org/html/2606.26360#S3.SS1.p1.1)\.
- \[45\]C\. Shih\(1987\)The phonetics of the Chinese tonal system\.AT&T Bell Labs technical memo\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[46\]Y\. Sun and C\. Shih\(2021\)Boundary\-conditioned anticipatory tonal coarticulation in Standard Mandarin\.Journal of Phonetics84,pp\. 101018\.External Links:[Document](https://dx.doi.org/10.1016/j.wocn.2020.101018)Cited by:[§3\.2](https://arxiv.org/html/2606.26360#S3.SS2.p1.1)\.
- \[47\]R\. C\. Team\(2020\)R: a language and environment for statistical computing\.Foundation for Statistical Computing\.Cited by:[§4](https://arxiv.org/html/2606.26360#S4.p1.1)\.
- \[48\]C\. Tseng\(2004\)Prosodic properties of intonation in two major varieties of Mandarin Chinese: Mainland China vs\. Taiwan\.InInternational Symposium on Tonal Aspects of Languages: With Emphasis on Tone Languages,pp\. 28–31\.Cited by:[§2\.2](https://arxiv.org/html/2606.26360#S2.SS2.p1.1)\.
- \[49\]L\. Van der Maaten and G\. Hinton\(2008\)Visualizing data using t\-sne\.\.Journal of machine learning research9\(11\)\.Cited by:[§4\.3](https://arxiv.org/html/2606.26360#S4.SS3.p3.2)\.
- \[50\]Y\. Wei\(2011\)Wulumuqi hanyu fangyan qingsheng de yuyin xingzhi ji youxuanlun fenxi \(an acoustic and ot analysis of the neutral tone in ürümqi mandarin\) 乌鲁木齐汉语 方言轻声的语音性质及优选论分析\.\.Journal of Southwest Agricultural University \(Social Science Edition\) 西南农业大学学报\(社会科学版\)9\(1\)\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
- \[51\]S\. N\. Wood\(2017\)Generalized additive models: an introduction with R\.CRC press\.Cited by:[§1](https://arxiv.org/html/2606.26360#S1.p2.1),[§2\.3](https://arxiv.org/html/2606.26360#S2.SS3.p1.1),[§4](https://arxiv.org/html/2606.26360#S4.p1.1)\.
- \[52\]Y\. Wu, M\. Adda\-Decker, and L\. Lamel\(2023\)Mandarin lexical tone duration: impact of speech style, word length, syllable position and prosodic position\.Speech Communication146,pp\. 45–52\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p2.1),[§2](https://arxiv.org/html/2606.26360#S2.p4.1)\.
- \[53\]C\. Xu\(2024\)Cross\-dialectal perspectives on Mandarin neutral tone\.Journal of Phonetics106,pp\. 101341\.External Links:[Document](https://dx.doi.org/https%3A//doi.org/10.1016/j.wocn.2024.101341)Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1),[§2](https://arxiv.org/html/2606.26360#S2.p4.1),[§5](https://arxiv.org/html/2606.26360#S5.p3.1)\.
- \[54\]C\. Xu\(2025\)Plastic Mandarin tones: regional identity in prosody\.Phonetica82\(5\),pp\. 331–362\.External Links:[Document](https://dx.doi.org/10.1515/phon-2025-0001)Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
- \[55\]Y\. Xu\(1997\)Contextual tonal variations in Mandarin\.Journal of phonetics25\(1\),pp\. 61–83\.Cited by:[item 1](https://arxiv.org/html/2606.26360#S1.I1.i1.p1.1)\.
- \[56\]F\. Yan\(2024\)On word stress in Mandarin: evidence from the differences between qingsheng 轻声 and qingyin 轻音\.International Journal of Chinese Linguistics11\(2\),pp\. 190–246\.External Links:[Document](https://dx.doi.org/10.1075/ijchl.00015.yan)Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p2.1)\.
- \[57\]M\. Yip\(2006\)Tone: phonology\.Elsevier\.Cited by:[§2](https://arxiv.org/html/2606.26360#S2.p3.1)\.
- \[58\]Q\. Zhang\(2005\)A Chinese yuppie in Beijing: phonological variation and the construction of a new professional identity\.Language in society34\(3\),pp\. 431–466\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p1.1)\.
- \[59\]Z\. Zhang and F\. Hu\(2020\)Neutral tone in Changde Mandarin\.InINTERSPEECH,pp\. 1923–1927\.Cited by:[§2\.1](https://arxiv.org/html/2606.26360#S2.SS1.p5.1)\.
Phonetic and semantic analyses of spoken corpora of Beijing and Taiwan Mandarin indicate that the neutral tone is a lexical tone

Similar Articles

Perceptual compensation for tonal context in self-supervised speech models

Mind Your Tone: Does Tone Alter LLM Performance?

Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis

Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition

Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation

Submit Feedback

Similar Articles

Perceptual compensation for tonal context in self-supervised speech models
Mind Your Tone: Does Tone Alter LLM Performance?
Probing in the Wild: A Case Study of Self-Supervised Speech Representations on Mandarin Sub-dialects with Unsupervised Articulatory Analysis
Tone-Conditioned Curriculum Learning for Low-Resource Bantu Speech Recognition
Evaluating and Preserving Lexical Stress in English-to-Chinese Speech-to-Speech Translation