Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems

arXiv cs.CL Papers

Summary

This paper proposes a reference-based evaluation protocol for assessing prosody and rhythm in speech-to-speech AI systems, using matched human conversation data to provide interpretable behavioral plausibility checks.

arXiv:2606.31055v1 Announce Type: new Abstract: Speech-to-speech (S2S) AI agents are advancing rapidly, yet evaluation lacks interpretable speech-native measures for conversational prosody and rhythm. Because $F_0$, speaking rate, articulation rate, and pausing shift with model-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output. Using 4000+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes for $F_0$ mean, $F_0$ expressivity, speech rate, articulation rate, pause ratio, and mean pause duration. We then define a percentile-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th-95th percentile out-of-regime flags. On held-out human rows, pooled references over-flag state-conditioned $F_0$ expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user-centered evaluation.
Original Article
View Cached Full Text

Cached at: 07/01/26, 05:32 AM

# Reference-Based Prosody and Rhythm Evaluation for Spoken Dialogue Systems ††thanks: This work was supported by the SPEAR project. Add full funding acknowledgments before submission.
Source: [https://arxiv.org/html/2606.31055](https://arxiv.org/html/2606.31055)
###### Abstract

Speech\-to\-speech \(S2S\) AI agents are advancing rapidly, yet evaluation lacks interpretable speech\-native measures for conversational prosody and rhythm\. BecauseF0F\_\{0\}, speaking rate, articulation rate, and pausing shift with model\-predicted speaker traits and interaction state, pooled human statistics can be poorly calibrated for evaluating a particular output\. Using 4000\+ hours of dyadic English conversation from the Seamless Interaction dataset, we construct matched reference regimes forF0F\_\{0\}mean,F0F\_\{0\}expressivity, speech rate, articulation rate, pause ratio, and mean pause duration\. We then define a percentile\-based evaluation protocol: extract the same metrics from an S2S output waveform, compare them to the closest matched human reference stratum, and report percentile deviations or 5th–95th percentile out\-of\-regime flags\. On held\-out human rows, pooled references over\-flag state\-conditionedF0F\_\{0\}expressivity and rhythm, while matched references return flag rates closer to the nominal 10% and make deviation direction interpretable\. These outputs serve as behavioral plausibility checks that complement, rather than replace, perceptual and user\-centered evaluation\.

## IIntroduction

Spoken conversation is a highly coordinated form of communication, where interlocutors manage turn\-taking while continuously adapting prosody and timing to the unfolding interaction\[[32](https://arxiv.org/html/2606.31055#bib.bib57),[22](https://arxiv.org/html/2606.31055#bib.bib35)\]\. Despite this complexity, transitions are typically smooth and supported by systematic cues, including prosodic and syntactic structure\[[20](https://arxiv.org/html/2606.31055#bib.bib42)\]\. Speech technology has progressed toward S2S AI agents capable of producing fluent spoken responses\. Evaluation, however, is still dominated by text\-based measures, task success, or subjective ratings, which do not directly quantify speech\-native interactional behavior\. Systems trained primarily on read speech often fail to reproduce properties characteristic of spontaneous dialogue\[[28](https://arxiv.org/html/2606.31055#bib.bib39)\]\.

Conversational speech differs from read speech across multiple dimensions, including prosodic range and stress realization\[[16](https://arxiv.org/html/2606.31055#bib.bib17),[15](https://arxiv.org/html/2606.31055#bib.bib18)\]\. Temporal organization varies with context and listener demands, shaping speaking rate and rhythmic structure\[[36](https://arxiv.org/html/2606.31055#bib.bib13),[42](https://arxiv.org/html/2606.31055#bib.bib24),[8](https://arxiv.org/html/2606.31055#bib.bib14),[44](https://arxiv.org/html/2606.31055#bib.bib25)\]\. These patterns align with accounts of dialogue as adaptive coordination, where speakers modify behavior to meet various contextual demands\[[10](https://arxiv.org/html/2606.31055#bib.bib40),[43](https://arxiv.org/html/2606.31055#bib.bib37)\]\.

Prosodic variation, particularly fundamental frequency \(F0F\_\{0\}\), encodes communicative intent, emotion, and discourse structure, and interacts with prominence and stress cues\[[4](https://arxiv.org/html/2606.31055#bib.bib41),[29](https://arxiv.org/html/2606.31055#bib.bib32),[45](https://arxiv.org/html/2606.31055#bib.bib31),[33](https://arxiv.org/html/2606.31055#bib.bib27),[34](https://arxiv.org/html/2606.31055#bib.bib28)\]\. Temporal structure is central to conversational timing, with systematic patterns in pausing and speaking rate\[[13](https://arxiv.org/html/2606.31055#bib.bib15),[31](https://arxiv.org/html/2606.31055#bib.bib22),[30](https://arxiv.org/html/2606.31055#bib.bib23),[19](https://arxiv.org/html/2606.31055#bib.bib16),[24](https://arxiv.org/html/2606.31055#bib.bib20),[25](https://arxiv.org/html/2606.31055#bib.bib21)\]\. Recent work also shows that conversational speech behaviors are context dependent, motivating analyses that explicitly condition on interaction factors\[[43](https://arxiv.org/html/2606.31055#bib.bib37)\]\.

Prior work has often examined prosody and timing in isolation or at a limited scale, leaving a gap in large\-scale, multidimensional reference regimes for conversational speech technology\. This gap is limiting for spoken dialogue evaluation, where a system may fall inside a pooled range while still producing prosody or timing that is implausible for the relevant speaker profile or interactional state\.

We frame large\-scale conversational distributions as an operational evaluation resource rather than as descriptive baselines alone\. Using robust percentile\-basedF0F\_\{0\}measures and rhythm metrics from over 4,000 hours of English dyadic conversation, we make four contributions\. First, we construct matched reference regimes for conversationalF0F\_\{0\}mean,F0F\_\{0\}expressivity, speech rate, articulation rate, pause ratio, and mean pause duration\. Second, we show that pooled references can obscure systematic shifts associated with model\-predicted sex label, model\-predicted age bin, arousal, and dominance\[[43](https://arxiv.org/html/2606.31055#bib.bib37),[40](https://arxiv.org/html/2606.31055#bib.bib29)\]\. Third, we define a simple evaluation procedure that extracts the same metrics from an S2S output waveform and reports percentile deviations or out\-of\-regime flags relative to the closest matched human reference stratum\. Fourth, we provide reproducible reference tables and an extraction/comparison protocol for speech\-native behavioral plausibility checks in spoken dialogue systems\.

## IIMethods

### II\-ADataset

We analyze dyadic conversational speech from the Seamless Interaction dataset, a large\-scale corpus of face\-to\-face interactions designed to capture both Naturalistic interactions with untrained participants and Improvised interactions with trained actors\. The released dataset contains 4,065\.04 hours of interaction time, comprising 64,739 interactions segmented from 5,098 one\-hour recording sessions involving 4,284 participants\[[1](https://arxiv.org/html/2606.31055#bib.bib1)\]\. We use “interaction” to refer to a single conversational segment within a session\. Each interaction yields two speaker channels, and all metrics are computed at the channel level\. The dataset includes Naturalistic interactions with untrained participants and Improvised interactions with trained actors, supporting analyses that condition on interaction factors\[[10](https://arxiv.org/html/2606.31055#bib.bib40),[43](https://arxiv.org/html/2606.31055#bib.bib37)\]\.

### II\-BProsodic Metrics

Fundamental frequency \(F0F\_\{0\}\) is defined only during voiced speech, so each audio signal is analyzed frame\-by\-frame using the autocorrelation\-basedF0F\_\{0\}estimator in Praat via parselmouth\[[18](https://arxiv.org/html/2606.31055#bib.bib11)\]\.F0F\_\{0\}extraction uses conservative bounds \(75–500 Hz\) to cover typical adult conversational ranges while reducing octave errors and spurious tracks\[[41](https://arxiv.org/html/2606.31055#bib.bib3)\]\. We define the voiced ratio as the proportion of frames assigned a nonzeroF0F\_\{0\}value and exclude speaker\-channels with a voiced ratio<0\.05<0\.05\[[38](https://arxiv.org/html/2606.31055#bib.bib34)\]\. To reduce sensitivity to spontaneous\-speech artifacts andF0F\_\{0\}\-tracking outliers, we compute percentile\-trimmed summaries by retaining voicedF0F\_\{0\}values between the 10th and 90th percentiles within each speaker\-channel\[[26](https://arxiv.org/html/2606.31055#bib.bib7)\]\. We report the 10–90% trimmed mean, standard deviation, and range ofF0F\_\{0\}\.

### II\-CTemporal Metrics

Temporal metrics are computed from word\-level timestamps in the dataset’s ASR\-aligned transcripts and the Voice Activity Detection \(VAD\) segments distributed with the corpus\[[1](https://arxiv.org/html/2606.31055#bib.bib1)\]\. To obtain stable long\-term estimates of rate, we follow evidence that speaking\-rate estimates stabilize over an average stabilization time of 12\.1 s \(most values between 7\.9 and 16\.2 s\)\[[2](https://arxiv.org/html/2606.31055#bib.bib2)\]\. Accordingly, we define speech\-activity stretches using the provided VAD, merge adjacent segments separated by at most 1\.0 s, and retain only continuous stretches with duration≥12\.1\\geq 12\.1s before computing speaker\-channel\-level temporal statistics\. We define pauses as inter\-word gaps≥0\.2\\geq 0\.2s, a common practical threshold because shorter silences are difficult to distinguish from stop closures and their inclusion increases annotation and measurement burden\[[6](https://arxiv.org/html/2606.31055#bib.bib38)\]\. LetWWbe the number of retained words,TTthe total duration \(sum of retained stretch durations\), andPPthe total pause time \(sum of inter\-word gaps≥0\.2\\geq 0\.2s within stretches\)\. In semi\-spontaneous speech, WPM showed a very strong correlation with naïve listeners’ tempo ratings\[[17](https://arxiv.org/html/2606.31055#bib.bib49)\], and hence we report speaking rates in words per minute \(WPM\) because it closely tracks perceived speech tempo and is directly measurable from word\-aligned timestamps in the dataset\. We report speech rate=60⋅WT=60\\cdot\\frac\{W\}\{T\}and articulation rate=60⋅W\(T−P\)=60\\cdot\\frac\{W\}\{\(T\-P\)\}\[[12](https://arxiv.org/html/2606.31055#bib.bib56)\]in WPM, along with pause ratio=PT=\\frac\{P\}\{T\}\.

TABLE I:Pooled operating regimes \(speaker\-channel level\)\. For each track,NNand interaction\-hours are constant across metrics and are reported in the header row\.MetricMedianIQR \(25–75%\)Mean\\rowcolorgray\!20Prosody\(N=121,813N\{=\}121\{,\}813,3,8633\{,\}863h\)F0F\_\{0\}Mean \[Hz\]157\.4120\.1–198\.6161\.5F0F\_\{0\}SD \[Hz\]20\.8413\.79–30\.0723\.22F0F\_\{0\}Range \[Hz\]87\.1157\.11–125\.895\.82\\rowcolorgray\!20Temporal\(N=91,471N\{=\}91\{,\}471,3,0453\{,\}045h\)Speech rate \[wpm\]175\.9156\.0–195\.9175\.8Articulation rate \[wpm\]237\.8216\.1–259\.5237\.2Pause ratio0\.25750\.2166–0\.29960\.2595Mean pause duration \[s\]0\.58450\.5225–0\.65590\.6058
### II\-DSpeaker Trait Annotations \(Vox\-Profile\)

Since Seamless Interaction does not include ground\-truth speaker\-trait metadata, we augment Seamless Interactions’ metadata with model\-conditioned speaker\-trait and interaction\-state annotations using Vox\-Profile, a benchmark and toolchain for characterizing static and dynamic speech traits using speech foundation models\[[9](https://arxiv.org/html/2606.31055#bib.bib55)\]\. For each speaker\-channel waveform, we first resample the audio to 16 kHz and apply Silero VAD\[[37](https://arxiv.org/html/2606.31055#bib.bib50)\]to extract speech\-only material\. We then run two WavLM\-based predictors released with Vox\-Profile – a multitask age/sex model and a dimensional emotion model\. The age/sex model outputs an age estimate \(mapped to model\-predicted age bins for analysis\) and a binary sex prediction with an associated posterior probability, as the original dataset does not provide these speaker traits, while the emotion model outputs continuous arousal, valence, and dominance scores in\[0,1\]\[0,1\]\. Vox\-Profile reports high performance for sex classification \(97\.7% acc\., macro\-F1 0\.971\) and moderate performance for age\-bin prediction \(67\.6% acc\., macro\-F1 0\.624\)\[[9](https://arxiv.org/html/2606.31055#bib.bib55)\]\.

We focus on model\-predicted sex label, model\-predicted age bin, arousal, and dominance as stratification variables because they capture complementary, operationally relevant sources of variation for conversational prosody and rhythm\. Model\-predicted sex label is expected to strongly track habitualF0F\_\{0\}\-related statistics associated with anatomical and physiological voice differences\[[40](https://arxiv.org/html/2606.31055#bib.bib29),[38](https://arxiv.org/html/2606.31055#bib.bib34)\]\. Chronological age has been linked to systematic differences in conversational timing and fluency\-related measures such as speaking rate and pausing, motivating the use of model\-predicted age bin as an operational stratifier\[[36](https://arxiv.org/html/2606.31055#bib.bib13),[13](https://arxiv.org/html/2606.31055#bib.bib15),[14](https://arxiv.org/html/2606.31055#bib.bib52)\]\. Arousal and dominance provide continuous proxies for interactional state that modulate prosodic expressivity and temporal pacing in natural speech\[[4](https://arxiv.org/html/2606.31055#bib.bib41),[29](https://arxiv.org/html/2606.31055#bib.bib32),[3](https://arxiv.org/html/2606.31055#bib.bib53)\]\. In preliminary screening across available model\-conditioned speaker traits and interaction\-state variables, these four factors produced the most consistent and interpretable shifts in the prosodic and temporal distributions studied here, and we therefore center them in a compact, evaluation\-oriented analysis\.

### II\-EPooled vs\. Matched Evaluation Check

To test whether matched regimes improve evaluation calibration, we split speaker\-channel rows into deterministic participant\-held\-out calibration and evaluation sets\. For each metric, we estimate 5th–95th percentile thresholds from the calibration split under two references: a pooled reference using all usable calibration rows, and a matched reference using the relevant stratum\. We match meanF0F\_\{0\}by model\-predicted sex label and matchF0F\_\{0\}expressivity and temporal metrics by arousal or dominance sextile\. We then report the held\-out evaluation percentage falling outside each interval; a calibrated 5th–95th percentile reference should flag approximately 10% of human conversational rows\.

## IIIResults

TABLE II:Model\-predicted sex\-label effects across prosodic and temporal metrics\. Values are computed on the usable prosody subset \(Fig\. 1;N=121,813N\{=\}121\{,\}813\) and temporal subset \(Fig\. 4;N=91,471N\{=\}91\{,\}471,3,0453\{,\}045h\)\. Effect size is Cliff’sδ\\delta\(Male vs\. Female; Mann–WhitneyUU\)\.MetricGroupValues\(Cliff’sδ\\delta\)LabelMedianMeanUnit\\rowcolorgray\!20Prosody \(Fig\. 1\)MeanF0F\_\{0\}\(10–90%\)Male121\.9125\.7Hz\(δ=−0\.957\\delta=\-0\.957\)Female200\.7202\.7HzF0F\_\{0\}SD \(10–90%\)Male15\.0017\.82Hz\(δ=−0\.635\\delta=\-0\.635\)Female27\.8129\.44HzF0F\_\{0\}range \(10–90%\)Male61\.9573\.44Hz\(δ=−0\.644\\delta=\-0\.644\)Female116\.8121\.6Hz\\rowcolorgray\!20Temporal \(Fig\. 4\)Speech rateMale177\.64177\.50wpm\(δ=0\.066\\delta=0\.066\)Female173\.95173\.94wpmArticulation rateMale242\.94242\.02wpm\(δ=0\.180\\delta=0\.180\)Female232\.55231\.73wpmPause ratioMale0\.26560\.2673–\(δ=0\.155\\delta=0\.155\)Female0\.24830\.2505–Mean pause durationMale0\.5940\.616s\(δ=0\.118\\delta=0\.118\)Female0\.5730\.594s
TABLE III:State and model\-predicted age\-bin effects on prosodic expressivity and conversational timing\. Effects are Spearmanρ\\rhofor arousal/dominance and Kruskal–Wallisϵ2\\epsilon^\{2\}for model\-predicted age bins\. All reported effects are statistically significant \(two\-sided;p≈0p\\approx 0at this scale\)\. Usable subsets match Figs\. 2–4 \(prosody/state:N=121,813N\{=\}121\{,\}813; temporal/state:N=91,471N\{=\}91\{,\}471\) and Fig\. 5 \(age\-bin: prosodyN=118,033N\{=\}118\{,\}033; temporalN=88,287N\{=\}88\{,\}287\)\.TABLE IV:Naturalistic vs\. Improvised subset comparison\. Medians are in Hz forF0F\_\{0\}, wpm for rates, seconds for mean pause duration, and unitless for pause ratio\. Effect size is Cliff’sδ\\delta\(Naturalistic vs\. Improvised\)\.
TABLE V:Held\-out out\-of\-regime flag rates under pooled and matched 5th–95th percentile references\. A calibrated reference should flag about 10% of human rows\. Matched references use model\-predicted sex label for meanF0F\_\{0\}and state sextiles for expressivity/rhythm\.
![Refer to caption](https://arxiv.org/html/2606.31055v1/x1.png)Figure 1:Prosodic operating regimes \(10–90% trimmed\) stratified by model\-predicted sex label\. Kernel density curves \(scaled to counts:N=121,813N\{=\}121\{,\}813; Male=65,214=65\{,\}214, Female=56,599=56\{,\}599\) are shown forF0F\_\{0\}Mean, SD, and range\. Vertical dashed lines mark group medians\.
![Refer to caption](https://arxiv.org/html/2606.31055v1/x2.png)Figure 2:State\-drivenF0F\_\{0\}expressivity across arousal sextiles \(S1–S6\)\. Boxplots summarize 10–90% trimmedF0F\_\{0\}standard deviation andF0F\_\{0\}range within arousal bins \(N=121,813N\{=\}121\{,\}813\)\.
![Refer to caption](https://arxiv.org/html/2606.31055v1/x3.png)Figure 3:F0F\_\{0\}expressivity across dominance sextiles \(S1–S6\)\. Boxplots summarize 10–90% trimmedF0F\_\{0\}standard deviation andF0F\_\{0\}range within dominance bins \(N=121,813N\{=\}121\{,\}813\)\.
We summarize the operating regimes of conversational prosody and rhythm and quantify how these regimes vary with speaker and interaction factors\. Throughout,NNdenotes the number of usable speaker\-channel samples satisfying the subset criteria for a given figure/table, and hours denote total speech duration aggregated over those usable samples \(reported as interaction\-time\)\. We use the term “expressivity” to jointly refer to the range and standard deviation ofF0F\_\{0\}\. Table[II\-C](https://arxiv.org/html/2606.31055#S2.SS3)reports pooled reference ranges, Table[III](https://arxiv.org/html/2606.31055#S3)summarizes two\-group model\-predicted sex\-label effects, Table[III](https://arxiv.org/html/2606.31055#S3)summarizes continuous state \(arousal, dominance\) and model\-predicted age\-bin effects, and Table[IV](https://arxiv.org/html/2606.31055#S3.T4)checks whether the Naturalistic and Improvised portions of Seamless Interaction require separate treatment\. Table[V](https://arxiv.org/html/2606.31055#S3.T5)gives the evaluation consequence of these shifts: pooled references substantially over\-flag low\- and high\-stateF0F\_\{0\}expressivity, while matched references return held\-out human rows close to the nominal 10% flag rate\. Since the corpus is large, we report effect sizes in addition to significance tests: for two\-group comparisons we use the Mann–WhitneyUUtest with Cliff’sδ\\deltato quantify distributional separation without assuming normality\[[7](https://arxiv.org/html/2606.31055#bib.bib4)\]; for continuous co\-variates \(arousal, dominance\) we use Spearman rank correlationρ\\rhoto capture monotonic associations robustly\[[35](https://arxiv.org/html/2606.31055#bib.bib44)\]; and for age\-bin comparisons we use the Kruskal–Wallis test withϵ2\\epsilon^\{2\}as an effect size for multi\-group differences\[[21](https://arxiv.org/html/2606.31055#bib.bib45),[39](https://arxiv.org/html/2606.31055#bib.bib54)\]\. The subset comparison shows negligible temporal differences and only smallF0F\_\{0\}differences \(maximum\|δ\|=0\.278\|\\delta\|=0\.278\), which are much weaker than the dominant speaker\- and state\-conditioned shifts; we therefore pool these subsets for the main reference characterization\.

### III\-AProsodic Regimes

We characterize conversational prosody using robust 10–90% trimmedF0F\_\{0\}statistics that summarizeF0F\_\{0\}mean, SD, and range\. Figure[1](https://arxiv.org/html/2606.31055#S3.F1)shows that the meanF0F\_\{0\}is strongly conditioned by model\-predicted sex label, with large distributional separation that makes pooled absolute\-F0F\_\{0\}targets inappropriate for evaluation\. The held\-out check in Table[V](https://arxiv.org/html/2606.31055#S3.T5)shows that pooled mean\-F0F\_\{0\}flags are also directionally biased: male rows are flagged almost entirely below the pooled interval \(9\.8% low vs\. 0\.1% high\), while female rows are flagged above it \(0\.0% low vs\. 8\.6% high\)\. This aligns with long\-standing evidence that anatomical/physiological differences yield distinct habitualF0F\_\{0\}regimes across speakers\[[40](https://arxiv.org/html/2606.31055#bib.bib29),[38](https://arxiv.org/html/2606.31055#bib.bib34)\]\.

Beyond this baseline conditioning, Figures[2](https://arxiv.org/html/2606.31055#S3.F2)and[3](https://arxiv.org/html/2606.31055#S3.F3)show a consistent scaling of prosodic expressivity with interactional state\. In particular,F0F\_\{0\}SD andF0F\_\{0\}range increase monotonically across arousal and dominance sextiles obtained with VoxProfile\. This matches prior work showing that natural speech tends to use a widerF0F\_\{0\}bandwidth in higher\-activation emotional or interactional states, with larger pitch excursions when speakers are more activated or more socially assertive\[[4](https://arxiv.org/html/2606.31055#bib.bib41),[3](https://arxiv.org/html/2606.31055#bib.bib53),[11](https://arxiv.org/html/2606.31055#bib.bib46)\]\. If theF0F\_\{0\}produced by a speaker or an S2S dialogue system stays in a narrow\-band variation in high\-arousal or high\-dominance contexts, it may sound constrained or unnatural even when the meanF0F\_\{0\}is within a typical range\[[23](https://arxiv.org/html/2606.31055#bib.bib47)\]\. Prosodic evaluation should therefore include expressivity\-sensitive checks in addition toF0F\_\{0\}level, and interpret deviations relative to reference distributions stratified by arousal and dominance\[[29](https://arxiv.org/html/2606.31055#bib.bib32),[45](https://arxiv.org/html/2606.31055#bib.bib31)\]\. In held\-out rows, a pooledF0F\_\{0\}\-SD reference flags 21\.11% of low\-arousal and 16\.07% of high\-arousal samples, whereas arousal\-matched references reduce these rates to 10\.16% and 9\.54%, respectively\.

### III\-BTemporal Regimes

![Refer to caption](https://arxiv.org/html/2606.31055v1/x4.png)Figure 4:Conversational rhythm varies with interactional state\. Smoothed trends show speech rate \(left\) and pause ratio \(right\) as functions of arousal and dominance \(0–1\)\. Trends are computed from equal\-count bins \(N=91,471N\{=\}91\{,\}471\) over the state axis, followed by smoothing\. Shaded bands indicate≈95%\\approx 95\\%uncertainty \(SEM\-based\)\. Bins with sparse data are omitted to reduce unstable end effects\.![Refer to caption](https://arxiv.org/html/2606.31055v1/x5.png)Figure 5:Model\-predicted age\-bin effects on prosody and rhythm\. Left: meanF0F\_\{0\}\(10–90% trimmed\) across Vox\-Profile age bins \(18–29 / 30–59 / 60\+\) stratified by model\-predicted sex label \(N=118,033N\{=\}118\{,\}033\)\. Right: speech rate and pause ratio distributions across the same model\-predicted age bins \(N=88,287N\{=\}88\{,\}287\)\.We characterize conversational rhythm with speech rate and pause ratio, capturing how time is allocated between verbal output and silence\. Table[III](https://arxiv.org/html/2606.31055#S3)shows that model\-predicted sex\-label effects on temporal metrics are negligible for speech rate and mean pause duration, and small for articulation rate and pause ratio\. Figure[4](https://arxiv.org/html/2606.31055#S3.F4)shows stronger state dependence, where speech rate increases with arousal and dominance, while pause ratio decreases\. This coupled pattern is consistent with classic observations that perceived “speed of talking” is driven heavily by the structure and frequency of pauses rather than articulation speed alone, motivating joint descriptions of conversational rhythm\[[13](https://arxiv.org/html/2606.31055#bib.bib15),[14](https://arxiv.org/html/2606.31055#bib.bib52)\]\. It also agrees with applied\-linguistics evidence that conversational speech rates occupy a bounded range while varying by situation and interaction\[[36](https://arxiv.org/html/2606.31055#bib.bib13),[27](https://arxiv.org/html/2606.31055#bib.bib51)\]\. For high\-arousal rows, pooled references flag 15\.76% of speech\-rate samples and 13\.18% of pause\-ratio samples, while arousal\-matched references reduce these rates to 12\.50% and 10\.84%, respectively \(Table[V](https://arxiv.org/html/2606.31055#S3.T5)\)\.

Model\-predicted age bin further shifts temporal regimes in a way that matters operationally for evaluation\. Figure[5](https://arxiv.org/html/2606.31055#S3.F5)shows that model\-predicted age bins are associated with changes in speech rate and pause ratio, implying that timing targets should not be treated as universal across this operational stratifier\. This result complements prior conversation analyses showing that timing patterns can index participation style and interactional dynamics\[[5](https://arxiv.org/html/2606.31055#bib.bib48)\]\. For conversational systems, the practical implication is that timing\-based evaluation should assess pace and pausing jointly, and condition on model\-predicted age bin when that stratifier is available\.

## IVReference\-Based Evaluation Protocol

The reference regimes in Section[III](https://arxiv.org/html/2606.31055#S3)are intended to support a lightweight evaluation procedure for S2S dialogue outputs\. Given an output waveform from a spoken dialogue system, the protocol is:

1. 1\.Extract the same prosodic and temporal metrics used in this study:F0F\_\{0\}mean,F0F\_\{0\}SD,F0F\_\{0\}range, speech rate, articulation rate, pause ratio, and mean pause duration\.
2. 2\.Select a reference stratum using the available conditioning variables, such as model\-predicted sex label for absoluteF0F\_\{0\}, arousal or dominance bins forF0F\_\{0\}expressivity and rhythm, and model\-predicted age bin when timing comparisons require it\. If a conditioning variable is unavailable, use the coarsest applicable reference and report that comparison scope explicitly\.
3. 3\.Convert each system metricmmto a percentilepmp\_\{m\}under the selected human reference distribution\.
4. 4\.Flag metrics below the 5th percentile or above the 95th percentile as out\-of\-regime, and report the output as a vector of percentile deviations rather than a single opaque score\.

This report identifies which speech\-native dimensions are atypical, the direction of the deviation, and the matched conversational regime under which the deviation was measured\. The protocol is therefore a behavioral plausibility check: it can flag prosodic compression, unusually fast or slow pacing, or atypical pause allocation, but it does not claim to replace human judgments of naturalness or interaction quality\.

## VLimitations

This framework provides behavioral plausibility checks rather than a validated perceptual naturalness model\. The reference regimes and conditioning effects do not yet map deviations to perceptual thresholds or user\-rated interaction quality\. Results are derived from a single English dyadic corpus\. Operating regions may shift with language, domain, recording conditions, and interaction setting\. Model\-predicted age bin, arousal, dominance, and sex label are derived from Vox\-Profile rather than ground\-truth speaker\-trait metadata, so observed stratification effects should be interpreted as model\-conditioned annotations and validated against ground\-truth metadata where available\. The analysis uses a binary sex label predicted from voice and hence does not model non\-binary or self\-identified identity categories\. Future work should validate these regimes against human judgments and system outputs across multiple datasets and languages\.

## Acknowledgment

The preferred spelling of the word “acknowledgment” in America is without an “e” after the “g”\. Avoid the stilted expression “one of us \(R\. B\. G\.\) thanks…\\ldots”\. Instead, try “R\. B\. G\. thanks…\\ldots”\. Put sponsor acknowledgments in the unnumbered footnote on the first page\.

## References

- \[1\]V\. Agrawal, A\. Akinyemi, K\. Alvero, M\. Behrooz, J\. Buffalini, F\. M\. Carlucci, J\. Chen, J\. Chen, Z\. Chen, S\. Cheng, P\. Chowdary, J\. Chuang, A\. D’Avirro, J\. Daly, N\. Dong, M\. Duppenthaler, C\. Gao, J\. Girard, M\. Gleize, S\. Gomez, H\. Gong, S\. Govindarajan, B\. Han, S\. He, D\. Hernandez, Y\. Hristov, R\. Huang, H\. Inaguma, S\. Jain, R\. Janardhan, Q\. Jia, C\. Klaiber, D\. Kovachev, M\. Kumar, H\. Li, Y\. Li, P\. Litvin, W\. Liu, G\. Ma, J\. Ma, M\. Ma, X\. Ma, L\. Mantovani, S\. Miglani, S\. Mohan, L\. Morency, E\. Ng, K\. Ng, T\. A\. Nguyen, A\. Oberai, B\. Peloquin, J\. Pino, J\. Popovic, O\. Poursaeed, F\. Prada, A\. Rakotoarison, A\. Richard, C\. Ropers, S\. Saleem, V\. Sharma, A\. Shcherbyna, J\. Shen, J\. Shen, A\. Stathopoulos, A\. Sun, P\. Tomasello, T\. Tran, A\. Turkatenko, B\. Wan, C\. Wang, J\. Wang, M\. Williamson, C\. Wood, T\. Xiang, Y\. Yang, Z\. Yao, C\. Zhang, J\. Zhang, X\. Zhang, J\. Zheng, P\. Zhyzheria, J\. Zikes, and M\. Zollhoefer\(2025\)Seamless interaction: dyadic audiovisual motion modeling and large\-scale dataset\.External Links:[Link](https://ai.meta.com/research/publications/%0Aseamless-interaction-dyadic-audiovisual-motion-modeling-and-%0Alarge-scale-dataset/)Cited by:[§II\-A](https://arxiv.org/html/2606.31055#S2.SS1.p1.1),[§II\-C](https://arxiv.org/html/2606.31055#S2.SS3.p1.9)\.
- \[2\]P\. Arantes, A\. Eriksson, and V\. G\. Lima\(2018\)Minimum sample length for the estimation of long\-term speaking rate\.InProc\. 9th International Conference on Speech Prosody,Vol\.2018,pp\. 661–665\.Cited by:[§II\-C](https://arxiv.org/html/2606.31055#S2.SS3.p1.9)\.
- \[3\]R\. Banse and K\. R\. Scherer\(1996\)Acoustic profiles in vocal emotion expression\.\.Journal of personality and social psychology70\(3\),pp\. 614\.Cited by:[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p2.7)\.
- \[4\]T\. Bänziger and K\. R\. Scherer\(2005\-07\)The role of intonation in emotional expressions\.Speech Communication46\(3\),pp\. 252–267\.External Links:ISSN 0167\-6393,[Document](https://dx.doi.org/10.1016/j.specom.2005.02.016)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1),[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p2.7)\.
- \[5\]N\. Campbell\(2008\)Individual traits of speaking style and speech rhythm in a spoken discourse\.InVerbal and Nonverbal Features of Human\-Human and Human\-Machine Interaction: COST Action 2102 International Conference, Patras, Greece, October 29\-31, 2007\. Revised Papers,pp\. 107–120\.Cited by:[§III\-B](https://arxiv.org/html/2606.31055#S3.SS2.p2.1)\.
- \[6\]E\. Campione, J\. Véronis,et al\.\(2002\)A large\-scale multilingual study of silent pause duration\.InSpeech prosody,Vol\.2002,pp\. 199–202\.Cited by:[§II\-C](https://arxiv.org/html/2606.31055#S2.SS3.p1.9)\.
- \[7\]N\. Cliff\(1993\)Dominance statistics: ordinal analyses to answer ordinal questions\.\.Psychological bulletin114\(3\),pp\. 494\.Cited by:[§III](https://arxiv.org/html/2606.31055#S3.63.63.42)\.
- \[8\]S\. Dowding, C\. Gutwin, and A\. Cockburn\(2024\-04\)User speech rates and preferences for system speech rates\.International Journal of Human\-Computer Studies184,pp\. 103222\(en\)\.External Links:ISSN 10715819,[Link](https://linkinghub.elsevier.com/retrieve/pii/S1071581924000065),[Document](https://dx.doi.org/10.1016/j.ijhcs.2024.103222)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1)\.
- \[9\]T\. Feng, J\. Lee, A\. Xu, Y\. Lee, T\. Lertpetchpun, X\. Shi, H\. Wang, T\. Thebaud, L\. Moro\-Velazquez, D\. Byrd,et al\.\(2025\)Vox\-profile: a speech foundation model benchmark for characterizing diverse speaker and speech traits\.arXiv preprint arXiv:2505\.14648\.Cited by:[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p1.1)\.
- \[10\]R\. Fusaroli, J\. Raczaszek\-Leonardi, and K\. Tylen\(2014\-01\)Dialog as interpersonal synergy\.New Ideas in Psychology32,pp\. 147–157\.External Links:ISSN 0732\-118X,[Document](https://dx.doi.org/10.1016/j.newideapsych.2013.03.005)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1),[§II\-A](https://arxiv.org/html/2606.31055#S2.SS1.p1.1)\.
- \[11\]P\. Geng, W\. Gu, K\. Johnson, and D\. Erickson\(2020\)Acoustic\-prosodic and articulatory characteristics of the mandarin speech conveying dominance or submissiveness\.InProc\. 10th International Conference on Speech Prosody,pp\. 424–428\.Cited by:[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p2.7)\.
- \[12\]F\. Goldman\-Eisler\(1956\)The determinants of the rate of speech output and their mutual relations\.\.Journal of Psychosomatic Research\.Cited by:[§II\-C](https://arxiv.org/html/2606.31055#S2.SS3.p1.9)\.
- \[13\]F\. Goldman\-Eisler\(1958\)Speech analysis and mental processes\.Language and speech1\(1\),pp\. 59–75\.Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1),[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-B](https://arxiv.org/html/2606.31055#S3.SS2.p1.1)\.
- \[14\]F\. Goldman\-Eisler\(1961\)The significance of changes in the rate of articulation\.Language and Speech4\(3\),pp\. 171–174\.Cited by:[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-B](https://arxiv.org/html/2606.31055#S3.SS2.p1.1)\.
- \[15\]V\. Hazan and R\. Baker\(2010\)Does reading clearly produce the same acoustic\-phonetic modifications as spontaneous speech in a clear speaking style?\.InProc\. DiSS 2010,pp\. 7–10\.Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1)\.
- \[16\]P\. Howell and K\. Kadi\-Hanifi\(1991\)Comparison of prosodic properties between read and spontaneous speech material\.Speech communication10\(2\),pp\. 163–169\.Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1)\.
- \[17\]J\. Iwarsson, J\. Naes, and R\. Hollen\(2023\)Measuring speaking rate: how do objective measurements correlate with audio\-perceptual ratings?\.Logopedics Phoniatrics Vocology48\(2\),pp\. 57–66\.Cited by:[§II\-C](https://arxiv.org/html/2606.31055#S2.SS3.p1.9)\.
- \[18\]Y\. Jadoul, B\. Thompson, and B\. De Boer\(2018\)Introducing parselmouth: a python interface to praat\.Journal of Phonetics71,pp\. 1–15\.Cited by:[§II\-B](https://arxiv.org/html/2606.31055#S2.SS2.p1.8)\.
- \[19\]Y\. Jiao, V\. Berisha, M\. Tu, T\. Huston, and J\. Liss\(2015\-11\)Estimating speaking rate in spontaneous discourse\.In2015 49th Asilomar Conference on Signals, Systems and Computers,Pacific Grove, CA, USA,pp\. 1189–1192\(en\)\.External Links:ISBN 978\-1\-4673\-8576\-3,[Link](http://ieeexplore.ieee.org/document/7421328/),[Document](https://dx.doi.org/10.1109/ACSSC.2015.7421328)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[20\]H\. Koiso, Y\. Horiuchi, S\. Tutiya, A\. Ichikawa, and Y\. Den\(1998\-07\)An Analysis of Turn\-Taking and Backchannels Based on Prosodic and Syntactic Features in Japanese Map Task Dialogs\.Language and Speech41\(3\-4\),pp\. 295–321\.External Links:ISSN 0023\-8309,[Link](https://doi.org/10.1177/002383099804100404),[Document](https://dx.doi.org/10.1177/002383099804100404)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p1.1)\.
- \[21\]W\. H\. Kruskal and W\. A\. Wallis\(1952\)Use of ranks in one\-criterion variance analysis\.Journal of the American statistical Association47\(260\),pp\. 583–621\.Cited by:[§III](https://arxiv.org/html/2606.31055#S3.63.63.42)\.
- \[22\]S\. C\. Levinson\(2016\-01\)Turn\-taking in Human Communication – Origins and Implications for Language Processing\.Trends in Cognitive Sciences20\(1\),pp\. 6–14\(en\)\.External Links:ISSN 13646613,[Link](https://linkinghub.elsevier.com/retrieve/pii/S1364661315002764),[Document](https://dx.doi.org/10.1016/j.tics.2015.10.010)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p1.1)\.
- \[23\]X\. Liu, Y\. Xu, W\. Zhang, and X\. Tian\(2021\)Multiple prosodic meanings are conveyed through separate pitch ranges: evidence from perception of focus and surprise in mandarin chinese\.Cognitive, Affective, & Behavioral Neuroscience21\(6\),pp\. 1164–1175\.Cited by:[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p2.7)\.
- \[24\]N\. Morgan and E\. Fosler\-Lussier\(1998\)Combining multiple estimators of speaking rate\.InProceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP ’98 \(Cat\. No\.98CH36181\),Vol\.2,Seattle, WA, USA,pp\. 729–732\(en\)\.External Links:ISBN 978\-0\-7803\-4428\-0,[Link](http://ieeexplore.ieee.org/document/675368/),[Document](https://dx.doi.org/10.1109/ICASSP.1998.675368)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[25\]H\. Nanjo and T\. Kawahara\(2004\-07\)Language Model and Speaking Rate Adaptation for Spontaneous Presentation Speech Recognition\.IEEE Transactions on Speech and Audio Processing12\(4\),pp\. 391–400\(en\)\.External Links:ISSN 1063\-6676,[Link](http://ieeexplore.ieee.org/document/1306512/),[Document](https://dx.doi.org/10.1109/TSA.2004.828641)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[26\]T\. Nguyen, C\. Van Nguyen, V\. D\. Lai, H\. Man, N\. T\. Ngo, F\. Dernoncourt, R\. A\. Rossi, and T\. H\. Nguyen\(2024\)Culturax: a cleaned, enormous, and multilingual dataset for large language models in 167 languages\.InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation \(LREC\-COLING 2024\),pp\. 4226–4237\.Cited by:[§II\-B](https://arxiv.org/html/2606.31055#S2.SS2.p1.8)\.
- \[27\]H\. Nishizawa\(2024\)Authenticity of academic lecture passages in high\-stakes tests: a temporal fluency perspective\.Language Testing41\(4\),pp\. 792–816\.Cited by:[§III\-B](https://arxiv.org/html/2606.31055#S3.SS2.p1.1)\.
- \[28\]J\. O’Mahony, C\. Lai, and S\. King\(2022\-09\)Combining conversational speech with read speech to improve prosody in Text\-to\-Speech synthesis\.InInterspeech 2022,pp\. 3388–3392\(en\)\.External Links:[Document](https://dx.doi.org/10.21437/Interspeech.2022-10167)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p1.1)\.
- \[29\]M\. D\. Pell, A\. Jaywant, L\. Monetta, and S\. A\. Kotz\(2011\-08\)Emotional speech processing: Disentangling the effects of prosody and semantic cues\.Cognition & Emotion25\(5\),pp\. 834–853\(en\)\.External Links:ISSN 0269\-9931, 1464\-0600,[Link](http://www.tandfonline.com/doi/abs/10.1080/02699931.2010.516915),[Document](https://dx.doi.org/10.1080/02699931.2010.516915)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1),[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p2.7)\.
- \[30\]F\. Pellegrino, C\. Coupe, and E\. Marsico\(2011\-09\)Across\-Language Perspective on Speech Information Rate\.Language87\(3\),pp\. 539–558\(en\)\.External Links:ISSN 1535\-0665,[Link](https://muse.jhu.edu/article/449938),[Document](https://dx.doi.org/10.1353/lan.2011.0057)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[31\]F\. Pellegrino, J\. Farinas, and J\. L\. Rouas\(2004\-03\)Automatic estimation of speaking rate in multilingual spontaneous speech\.InSpeech Prosody 2004,pp\. 517–520\(en\)\.External Links:[Document](https://dx.doi.org/10.21437/SpeechProsody.2004-119)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[32\]C\. Riest, A\. B\. Jorschick, and J\. P\. de Ruiter\(2015\-02\)Anticipation in turn\-taking: mechanisms and information sources\.Frontiers in Psychology6\(English\)\.External Links:ISSN 1664\-1078,[Document](https://dx.doi.org/10.3389/fpsyg.2015.00089)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p1.1)\.
- \[33\]A\.M\.C\. Sluijter and V\.J\. Van Heuven\(1996\)Acoustic correlates of linguistic stress and accent in Dutch and American English\.InProceeding of Fourth International Conference on Spoken Language Processing\. ICSLP ’96,Vol\.2,Philadelphia, PA, USA,pp\. 630–633\(en\)\.External Links:ISBN 978\-0\-7803\-3555\-4,[Link](http://ieeexplore.ieee.org/document/607440/),[Document](https://dx.doi.org/10.1109/ICSLP.1996.607440)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[34\]A\. M\. C\. Sluijter and V\. J\. Van Heuven\(1996\-10\)Spectral balance as an acoustic correlate of linguistic stress\.The Journal of the Acoustical Society of America100\(4\),pp\. 2471–2485\(en\)\.External Links:ISSN 0001\-4966, 1520\-8524,[Link](https://pubs.aip.org/jasa/article/100/4/2471/580350/Spectral-balance-as-an-acoustic-correlate-of),[Document](https://dx.doi.org/10.1121/1.417955)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1)\.
- \[35\]C\. Spearman\(1961\)The proof and measurement of association between two things\.\.American Journal of Psychology\.Cited by:[§III](https://arxiv.org/html/2606.31055#S3.63.63.42)\.
- \[36\]S\. Tauroza and D\. Allison\(1990\-03\)Speech Rates in British English\.Applied Linguistics11\(1\),pp\. 90–105\(en\)\.External Links:ISSN 0142\-6001, 1477\-450X,[Link](https://academic.oup.com/applij/article-lookup/doi/10.1093/applin/11.1.90),[Document](https://dx.doi.org/10.1093/applin/11.1.90)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1),[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-B](https://arxiv.org/html/2606.31055#S3.SS2.p1.1)\.
- \[37\]S\. Team\(2024\)Silero vad: pre\-trained enterprise\-grade voice activity detector \(vad\), number detector and language classifier\.GitHub\.Note:\\urlhttps://github\.com/snakers4/silero\-vadCited by:[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p1.1)\.
- \[38\]I\. R\. Titze\(1994\-03\)Toward standards in acoustic analysis of voice\.Journal of Voice8\(1\),pp\. 1–7\(en\)\.External Links:ISSN 08921997,[Link](https://linkinghub.elsevier.com/retrieve/pii/S0892199705803133),[Document](https://dx.doi.org/10.1016/S0892-1997%2805%2980313-3)Cited by:[§II\-B](https://arxiv.org/html/2606.31055#S2.SS2.p1.8),[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p1.6)\.
- \[39\]M\. Tomczak and E\. Tomczak\(2014\)The need to report effect size estimates revisited\. an overview of some recommended measures of effect size\.Trends in Sport Sciences\.Cited by:[§III](https://arxiv.org/html/2606.31055#S3.63.63.42)\.
- \[40\]H\. Traunmüller and A\. Eriksson\(2000\-06\)Acoustic effects of variation in vocal effort by men, women, and children\.The Journal of the Acoustical Society of America107\(6\),pp\. 3438–3451\(en\)\.External Links:ISSN 0001\-4966, 1520\-8524,[Link](https://pubs.aip.org/jasa/article/107/6/3438/554986/Acoustic-effects-of-variation-in-vocal-effort-by),[Document](https://dx.doi.org/10.1121/1.429414)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p5.3),[§II\-D](https://arxiv.org/html/2606.31055#S2.SS4.p2.1),[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p1.6)\.
- \[41\]A\. P\. Vogel, P\. Maruff, P\. J\. Snyder, and J\. C\. Mundt\(2009\)Standardization of pitch\-range settings in voice acoustic analysis\.Behavior research methods41\(2\),pp\. 318–324\.Cited by:[§II\-B](https://arxiv.org/html/2606.31055#S2.SS2.p1.8)\.
- \[42\]N\. Ward and S\. Nakagawa\(2004\-10\)Automatic User\-Adaptive Speaking Rate Selection\.International Journal of Speech Technology7\(4\),pp\. 259–268\(en\)\.External Links:ISSN 1381\-2416, 1572\-8110,[Link](https://link.springer.com/10.1023/B:IJST.0000037070.31146.f9),[Document](https://dx.doi.org/10.1023/B%3AIJST.0000037070.31146.f9)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1)\.
- \[43\]C\. J\. Wynn, T\. S\. Barrett, and S\. A\. Borrie\(2024\-05\)Conversational Speech Behaviors Are Context Dependent\.Journal of Speech, Language, and Hearing Research67\(5\),pp\. 1360–1369\(en\)\.External Links:ISSN 1092\-4388, 1558\-9102,[Document](https://dx.doi.org/10.1044/2024%5FJSLHR-23-00622)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1),[§I](https://arxiv.org/html/2606.31055#S1.p3.1),[§I](https://arxiv.org/html/2606.31055#S1.p5.3),[§II\-A](https://arxiv.org/html/2606.31055#S2.SS1.p1.1)\.
- \[44\]Y\. Xie, J\. Qu, Y\. Zhang, R\. Zhou, and A\. H\. S\. Chan\(2024\-11\)Speaking, fast or slow: how conversational agents’ rate of speech influences user experience\.Universal Access in the Information Society23\(4\),pp\. 1947–1956\(en\)\.External Links:ISSN 1615\-5289, 1615\-5297,[Link](https://link.springer.com/10.1007/s10209-023-01000-2),[Document](https://dx.doi.org/10.1007/s10209-023-01000-2)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p2.1)\.
- \[45\]Y\. Xu\(2005\-07\)Speech melody as articulatorily implemented communicative functions\.Speech Communication46\(3\-4\),pp\. 220–251\(en\)\.External Links:ISSN 01676393,[Link](https://linkinghub.elsevier.com/retrieve/pii/S0167639305000889),[Document](https://dx.doi.org/10.1016/j.specom.2005.02.014)Cited by:[§I](https://arxiv.org/html/2606.31055#S1.p3.1),[§III\-A](https://arxiv.org/html/2606.31055#S3.SS1.p2.7)\.

Similar Articles

Evaluating Speech Articulation Synthesis with Articulatory Phoneme Recognition

arXiv cs.CL

This paper proposes evaluating speech articulation synthesis using phoneme recognition with articulatory features, addressing limitations of traditional metrics like point-wise distance. Experiments on a single-speaker RT-MRI dataset show the approach captures phonetic nuances and improves assessment.

What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR

arXiv cs.CL

This paper proposes a dual-reference benchmarking approach for atypical ASR, using both verbatim and intended transcriptions to evaluate 11 ASR models on stuttered speech, highlighting the importance of selecting the appropriate reference depending on the use case.