How to Leverage Synthetic Speech for LLM-Based ASR Systems?
Summary
This paper investigates the distributional gap between synthetic and real speech in LLM-based ASR systems, identifies where the LLM separates them, and proposes using layer-selection and RIR augmentation to match real-data baselines with less real data.
View Cached Full Text
Cached at: 06/30/26, 05:29 AM
# How to Leverage Synthetic Speech for LLM-Based ASR Systems?
Source: [https://arxiv.org/html/2606.29031](https://arxiv.org/html/2606.29031)
Yanis Labrak1, Dairazalia Sanchez\-Cortes1, Sergio Burdisso1, Séverin Baroudi2, Shashi Kumar1,3, Esaú Villatoro\-Tello1, Srikanth Madikeri4, Manjunath K E5, Oldřich Plchot6, Kadri Hacioğlu5, Petr Motlicek1,6, Andreas Stolcke5
###### Abstract
In regulated domains such as banking and healthcare, where privacy constraints make real speech costly to collect and retain, synthetic speech from modern text\-to\-speech \(TTS\) is an appealing alternative for training automatic speech recognition \(ASR\) without exposing sensitive customer recordings\. Yet a persistent distributional gap between synthetic and real data limits how far it can replace genuine recordings\. Prior work largely treats this gap as a black box to be engineered around, but in our work, we instead examine its origin directly by probing a SLAM\-ASR architecture\. Then, we localise where its LLM backbone separates real from synthetic speech and find the discriminative signal concentrated in the early\-to\-middle layers, where temporal and prosodic perturbations disrupt it most\. We further show that representation\-level separability, help, but does not directly predict downstream ASR gains\. On the other hand, convolving synthetic audio with room impulse responses \(RIRs\) narrows the gap not by making synthetic speech sound cleaner or more natural, but by reproducing the acoustic irregularities of real recordings\. Translating these findings into the training procedure, by adding a layer\-selection module combined with RIR augmentation matches a fully real\-data baseline using only 25% of the real speech \(13\.6 h\) and surpasses it at all higher proportions\.
## IIntroduction
Speech data in regulated domains such as banking and healthcare is among the hardest to collect and retain at scale\. Customer–agent telephone calls contain personally identifiable and financial information, and data\-protection regimes such as the GDPR\[[1](https://arxiv.org/html/2606.29031#bib.bib1)\]constrain how such recordings may be stored, shared, and reused for model training, since voice recordings potentially qualifying as biometric data\. Synthetic speech from modern text\-to\-speech \(TTS\) offers an appealing way around these constraints, utterances can be generated on demand without exposing real customers, sidestepping the collection and retention of sensitive audio\. This appealing promise, could help to cut the cost and the privacy burden of annotating real speech for ASR\. However, it only holds if synthetic speech can actually substitute for real recordings, which in turn requires closing the distributional gap that still separates the two\. Most, modern LLM\-based ASR models still separate real from synthetic signals with high accuracy\[[2](https://arxiv.org/html/2606.29031#bib.bib2)\], limiting how far synthetic audio can replace real recordings in low\-resource settings\. Rather than treating this gap as a black box, we examine where it arises inside the model and use that to guide practical training recipes\. We organise the study around four research questions:
1. 1\.Where is real/synthetic discrimination encoded?Which layers of the LLM backbone most strongly separate the two classes and can be leveraged using layer\-wise weighted pooling \(Figure[1](https://arxiv.org/html/2606.29031#S3.F1)\)? At the same time, which signal\-level perturbations disrupt it most?
2. 2\.Do interpretability\-guided filters improve ASR?Does disrupting the real/synthetic boundary at the representation level translate to downstream WER gains?
3. 3\.How much real data can synthetic replace?What real/synthetic mix preserves or improves over an all\-real baseline?
4. 4\.How much synthetic data helps on top of full real data?Does raw or RIR\-augmented synthetic audio help when added to a augment real corpus?
Taken together, our findings provide one of the first answer to these questions\. Probing the LoRA\-adapted LLM backbone with overlap metrics \(Section[IV\-A](https://arxiv.org/html/2606.29031#S4.SS1)\), we localise where the synthetic/real gap is concentrated and which signal\-level perturbations affect it most \(Section[IV](https://arxiv.org/html/2606.29031#S4)\)\. We then show that this representation\-level separability does not necessarily predict downstream gains while guiding us \(Section[VI](https://arxiv.org/html/2606.29031#S5.T6)\), and that room impulse response \(RIR\) convolution narrows the gap not by improving perceptual quality but by reproducing the acoustic irregularities of real recordings \(Section[V](https://arxiv.org/html/2606.29031#S5)\)\. Guided by these findings, a per\-token layer\-wise weighted pooling over the decoder layers \(Figure[1](https://arxiv.org/html/2606.29031#S3.F1)\), combined with RIR augmentation, matches a fully real\-data baseline using only 25% of the real speech and surpasses it at all higher proportions \(Section[V\-D](https://arxiv.org/html/2606.29031#S5.SS4)\)\.
## IIBackground and Related Work
### II\-ASpeech LLMs and LLM\-based ASR
Coupling a speech encoder to a pre\-trained large language model \(LLM\) through a trainable projector has become a dominant recipe for general purpose audio understanding, we refer to this family as SpeechLLM\. Systems such as SALMONN\[[3](https://arxiv.org/html/2606.29031#bib.bib3)\], Qwen\-Audio\[[4](https://arxiv.org/html/2606.29031#bib.bib4)\], and AudioPaLM\[[5](https://arxiv.org/html/2606.29031#bib.bib5)\]connect self\-supervised or supervised encoders to a frozen or slightly adapted text LLM, reaching strong ASR performance alongside broader range of audio tasks\. SLAM\-ASR\[[6](https://arxiv.org/html/2606.29031#bib.bib6)\], shows that a single trainable projector between a frozen WavLM\[[7](https://arxiv.org/html/2606.29031#bib.bib7)\]encoder and a frozen LLM suffices for competitive performances, in contrast to fully supervised encoder\-decoder models such as Whisper\[[8](https://arxiv.org/html/2606.29031#bib.bib8)\]\. Work on these models has concentrated on linguistic performance, but, understanding how the LLM backbone encodes non\-linguistic properties, remains largely unexamined\[[9](https://arxiv.org/html/2606.29031#bib.bib9)\]\.
### II\-BTraining ASR with Synthetic Speech and Its Limits
TTS generated speech is widely used to augment low\-resource ASR\[[10](https://arxiv.org/html/2606.29031#bib.bib10),[11](https://arxiv.org/html/2606.29031#bib.bib11),[12](https://arxiv.org/html/2606.29031#bib.bib12)\], and modern multi\-speaker TTS has made this a practical foundation to their works\[[13](https://arxiv.org/html/2606.29031#bib.bib13),[14](https://arxiv.org/html/2606.29031#bib.bib14)\]\. The benefits, however, are uneven, and two obstacles remains not examined properly\. First, performances are often in those papers dependent on the amount of data generated\[[13](https://arxiv.org/html/2606.29031#bib.bib13)\]and not the diversity of its voices or the acoustic conditions\. Second, a systematic acoustic mismatch separates synthetic from real speech, open\-source TTS outputs are cleaner and more uniform than real recordings\[[15](https://arxiv.org/html/2606.29031#bib.bib15)\], so a model trained on it learns cues specific to the model and transfers poorly to real audio\[[11](https://arxiv.org/html/2606.29031#bib.bib11),[16](https://arxiv.org/html/2606.29031#bib.bib16)\]\. Recent methods reduce this mismatch in several ways: by merging the weights of models trained separately on real and synthetic data\[[16](https://arxiv.org/html/2606.29031#bib.bib16)\], by filtering and optimising the generation pipeline\[[14](https://arxiv.org/html/2606.29031#bib.bib14)\], or by letting ASR and TTS models refine one another in a closed loop\[[17](https://arxiv.org/html/2606.29031#bib.bib17)\]\. All of these act on the data, the TTS model, or the final weights of the ASR model, all considering the model’s training procedure as a black box\. None of this previous work, raised questions about the where inside that model does the synthetic/real distinction is actually represented\.
### II\-CInterpretability of Speech Models
Modern neural vocoders and diffusion models have pushed TTS perceptual quality toward human level, yet residual cues in pitch, prosody, and spectral envelope persist\[[18](https://arxiv.org/html/2606.29031#bib.bib18),[19](https://arxiv.org/html/2606.29031#bib.bib19)\]\. These cues constitute a distributional gap between synthetic and real speech that perceptual fidelity alone does not close\. The anti\-spoofing community studies precisely this gap, the ASVspoof initiative\[[20](https://arxiv.org/html/2606.29031#bib.bib20)\]and subsequent detectors\[[21](https://arxiv.org/html/2606.29031#bib.bib21)\]show that synthetic speech is reliably detectable because it carries systematic, learnable signatures\. The detectability that anti\-spoofing exploits and the mismatch that limits augmentation are thus two views of the same discrepancy\. What neither line of work asks is where, inside a speech model, that gap is encoded\. The closest evidence comes from interpretability work, which has surfaced cues about how speech models represent information yet has centered on the encoder\[[22](https://arxiv.org/html/2606.29031#bib.bib22),[23](https://arxiv.org/html/2606.29031#bib.bib23),[24](https://arxiv.org/html/2606.29031#bib.bib24)\], layer\-wise probing reveals a progression from acoustic to semantic features across encoder layers\[[25](https://arxiv.org/html/2606.29031#bib.bib25)\], and aggregating those layers through a learnable weighted sum, as popularised by SUPERB\[[26](https://arxiv.org/html/2606.29031#bib.bib26)\], is now standard practice\. Attention has only recently turned to the LLM backbone analysis that consumes these features for understanding tasks, focusing in particular on the drop in performance between text and speech inputs\[[27](https://arxiv.org/html/2606.29031#bib.bib27)\], rather than to acoustic distinctions within speech itself\. To our knowledge, no prior work examines how a speech LLM internally represents the synthetic/real gap, nor uses such an analysis to inform how synthetic data is exploited for training\.
## IIIExperimental Setup
### III\-AModels
We build our systems on top of the SLAM\-ASR framework\[[6](https://arxiv.org/html/2606.29031#bib.bib6)\], which couples a frozen WavLM\-Large\[[7](https://arxiv.org/html/2606.29031#bib.bib7)\]speech encoder to a Llama\-3\.2\-3B\-Instruct\[[28](https://arxiv.org/html/2606.29031#bib.bib28)\]backbone through a single\-hidden\-layer projector that downsamples the audio representations by a factor of55\. Domain adaptation uses LoRA\[[29](https://arxiv.org/html/2606.29031#bib.bib29)\]adapters \(r=16r=16,α=32\\alpha=32, dropout 0\.05\) on theq\_projandv\_projmodules of the LLM \(Figure[1](https://arxiv.org/html/2606.29031#S3.F1)\), with all other parameters frozen\. During the layer\-wise analysis, we are probing the architecture at its base checkpoint \(projector trained, LLM frozen\), before any domain fine\-tuning\. Representations are read directly from the 28 Llama layers, so the analysis reflects generic acoustic encoding rather than domain\-specific adaptation\.
\(a\)
\(b\)
Figure 1:Layer\-wise Weighted Pooling inside of Llama architecture\. All LLM hidden states \(L\) are weighted by a trainable parameter in order to select how each layer is kept before, optionally, the addition to the residual stream from the speech\. Once done, it’s passing through RMS Norm andlm\_headto output the textual tokens of the transcripts\.
### III\-BDatasets
Corpus\.All experiments use DefinedAI\[[30](https://arxiv.org/html/2606.29031#bib.bib30),[31](https://arxiv.org/html/2606.29031#bib.bib31)\], a corpus of manually transcribed English customer–agent telephone calls\. The base model checkpoint is pre\-trained from a mixed set of≃\\simeq38 hours, giving a starting\-point WER of 10\.90% on the held\-out banking test set which is composed of 3,164 utterances \(6\.55 hours\)\. All downstream fine\-tuning and the layer\-wise analysis use exclusively the banking training partition \(26,457 utterances, 54\.43 hours of real speech\)\. Synthetic counterparts are generated with Qwen3\-TTS\[[32](https://arxiv.org/html/2606.29031#bib.bib32)\]for the same utterances, totalling≃\\simeq51 hours with RIRs\. The 100%\-real baseline reaches 8\.68% WER\.
Synthetic speech generation\.Synthetic utterances are produced with the Qwen3\-TTS VoiceDesign variant, which is conditioned on a natural language voice prompt, rather than reference speaker audio\. We selected Qwen3\-TTS after preliminary comparative listening among the authors against several open\-source alternatives\[[33](https://arxiv.org/html/2606.29031#bib.bib33),[34](https://arxiv.org/html/2606.29031#bib.bib34),[35](https://arxiv.org/html/2606.29031#bib.bib35),[36](https://arxiv.org/html/2606.29031#bib.bib36),[37](https://arxiv.org/html/2606.29031#bib.bib37),[38](https://arxiv.org/html/2606.29031#bib.bib38)\], it gave the highest perceived naturalness, fewest artefacts, natural prosody and crucially, the most controllable voice characteristics, letting us match synthetic voices to real data persona metadata \(gender, race\) without any reference recordings\. A per\-role prompt is built from persona attributes and fixed style tokens \(*“clear articulation, naturally”*\), yielding a distinct synthetic voice per speaker role in every dialog\.
Room Impulse Response Augmentation\.Convolving clean speech with measured room impulse responses \(RIRs\) is an established robustness augmentation\[[39](https://arxiv.org/html/2606.29031#bib.bib39)\]\. For synthetic speech specifically, RIR convolution injects the room reverberation and channel variability absent from pristine TTS recordings, masking their clean condition signature and narrowing the synthetic/real gap\. We use the BUT Speech@FIT Reverb Database\[[40](https://arxiv.org/html/2606.29031#bib.bib40)\]for this purpose\.
Figure 2:Within\-corpus speaker diversity \(pairwise cosine distance betweenpyannote/embedding\[[41](https://arxiv.org/html/2606.29031#bib.bib41)\]vectors; higher = more spread\)\.Figure 3:Layer\-wise overlap metrics across all ablation conditions for 28 Llama layers\. Lower values indicate greater real/synthetic overlap\.Speaker diversity\.Despite using no real speaker recordings, VoiceDesign yields a synthetic corpus with*higher*within\-corpus acoustic diversity than the real data\. Figure[2](https://arxiv.org/html/2606.29031#S3.F2)shows pairwise cosine distances betweenpyannote/embedding\[[41](https://arxiv.org/html/2606.29031#bib.bib41)\]speaker vectors over all training utterances, the synthetic corpus \(purple\) is shifted right of real \(green\), dominant mode≈\\approx0\.88 vs\. 0\.68, i\.e\. more spread across speaker space\. It is also bimodal, a secondary mode \(≈\\approx0\.6\) overlaps real, so synthetic keeps a real\-like core while adding more diverse vocal characteristics than the acoustically homogeneous real corpus, thanks to generating a distinct voice per speaker role in each dialogue\.
TABLE I:Audio quality of corpora \(higher is better\)\. UTMOS = Predicted MOS; PESQ = Audio quality measurement between the raw synthetic and the real audios\.Audio quality\.Table[I](https://arxiv.org/html/2606.29031#S3.T1)reports UTMOS\[[42](https://arxiv.org/html/2606.29031#bib.bib42)\]and wideband PESQ for the three conditions\. Raw synthetic scores far higher than real on UTMOS \(4\.36 vs\. 2\.08\), confirming Qwen3\-TTS produces perceptually cleaner output than real telephone recordings\.
RIR convolution sharply degrades naturalness \(UTMOS 1\.34\), with only a marginal PESQ change against the synthetic source \(1\.26 vs\. 1\.12\)\. The ASR gains come not from better naturalness but from making synthetic speech acoustically messier, like real telephone recordings, bridging the domain gap, not the perceptual one\.
### III\-CAudio Filters
We study six signal\-level perturbations applied to synthetic audio, summarised with their hyperparameters in Table[II](https://arxiv.org/html/2606.29031#S3.T2)\. For the downstream ASR experiments we additionally evaluate four combined conditions: High/Low Pass each paired with Pitch Shift or Time Stretch\.
TABLE II:Audio filter hyperparameters\.We further apply real\-world RIR augmentation from the BUT Speech@FIT Reverb Database\[[40](https://arxiv.org/html/2606.29031#bib.bib40)\]to synthetic data, adding room acoustics as a stronger form of domain adaptation than the signal\-level perturbations alone\.
### III\-DTraining Protocol
Training proceeds in two stages\.Stage 1\(projector pre\-training\) trains only the speech encoder projector on the full mixed\-domains partition \(38\.17 hours\)\. The LLM backbone and WavLM encoder are kept frozen\.Stage 2\(domain adaptation\) loads theStage 1projector and fine\-tunes LoRA adapters on the banking specific data for each experimental condition\. All components except the LoRA weights remain frozen\. In the layer\-wise weighted pooling experiments, the layer selector is additionally trained\. Both stages share the following hyperparameters: AdamW, learning rate1×10−41\{\\times\}10^\{\-4\}, linear warmup over 1,000 steps followed by linear decay to0, batch size1010, gradient accumulation11, BF16 precision, random seed set at 42\. All experiments are individually run on a single NVIDIA H100 GPU until reaching epoch 5\.
## IVWhere is Real/Synthetic Discrimination Encoded?
Before evaluating downstream ASR performance, we probe the model’s encoder to localise*where*and*how strongly*the model separates real from synthetic speech, in order to know which audio perturbations are most effective at disrupting this separation\.
### IV\-AOverlap Metrics
We quantify synthetic/real gap at each layer with four metrics \(Figure[3](https://arxiv.org/html/2606.29031#S3.F3), key layers in Table[III](https://arxiv.org/html/2606.29031#S4.T3)\)\. Two are silhouette scores, measuring class separation in the full representation space and after PCA projection to 2D \(lower values mean more overlap\)\. Wasserstein\-1 on PC1, normalised by the pooled PC1 standard deviation for scale\-invariance across layers, measures distributional distance along the first principal component \(lower = more overlap\)\. The PCA\-2D KDE overlap coefficient fits a kernel density estimate per class in the 2D PCA space and integratesmin\(preal,psynth\)\\min\(p\_\{\\text\{real\}\},\\,p\_\{\\text\{synth\}\}\)over it \(higher values, indicate greater overlap and less separation\)\.
### IV\-BLayer\-wise Overlap Findings
Table[III](https://arxiv.org/html/2606.29031#S4.T3)reports the PCA\-2D KDE overlap coefficient per filter at three representative Llama layers\.
TABLE III:PCA\-2D KDE overlap per filter at layers 3/14/28 \(higher = higher overlap; bold = best per layer\)\.We highlight four observations: \(1\) Discrimination decreases monotonically with depth\. Silhouette scores converge near zero by layer 28 across all filters, and the overlap coefficients in Table[III](https://arxiv.org/html/2606.29031#S4.T3)reach 0\.75 to 0\.82 there, confirming that the final LLM layers collapse the synthetic/real gap regardless of perturbation\. The discriminative signal therefore concentrates in early\-to\-middle layers \(0 to 14\)\. \(2\) High Pass is the least effective filter\. Across all metrics and layers, High Pass maintains the highest silhouette scores, so spectral attenuation of low frequencies alone is insufficient to confuse the decoder\. \(3\) Time Stretch achieves the broadest early\-layer disruption\. It produces the lowest silhouette at layer≈3\\approx 3, most effectively increasing discrimination in early representations; Pitch Shift and Band Pass achieve similar disruption at middle layers \(layer≈14\\approx 14\)\. \(4\) Frequency\-band filters match prosodic ones\. Band Pass and Low Pass reduce discrimination nearly as well as Pitch Shift and Time Stretch in the mid\-layer range, suggesting that restricting frequency content targets the same cues as prosodic modification\. These observations guide filter selection for the downstream experiments: Time Stretch and Pitch Shift are prioritised individually, and their combination with High Pass tests whether spectral and prosodic perturbations are complementary at the ASR level even when High Pass alone fails to disrupt the representations\.
## VDownstream ASR Experiments
TABLE IV:Substitution: %WER on real test set\. Fixed budget variations the real/synth split at a constant total budget \(real \+ synth≈\\approx100%\); Fixed synth keeps the full synthetic set \(100%\) and varies the real fraction\.The layer\-wise analysis in Section[IV](https://arxiv.org/html/2606.29031#S4)shows that synthetic/real discrimination concentrates in the early\-to\-middle LLM layers and that temporal and prosodic perturbations disrupt it most\. We now test whether these insights yield downstream gains, alongside three practical questions: how much real data synthetic audio can replace \(§[V\-A](https://arxiv.org/html/2606.29031#S5.SS1)\), how much synthetic data helps on top of full real data \(§[V\-B](https://arxiv.org/html/2606.29031#S5.SS2)\), and whether interpretability\-guided filters add benefit \(§[V\-C](https://arxiv.org/html/2606.29031#S5.SS3)\)\. We also explore layer\-wise weighted pooling, an architectural modification that aims to extract more value from imperfect synthetic data \(§[V\-D](https://arxiv.org/html/2606.29031#S5.SS4)\)\.
### V\-AHow Much Real Data Can Be Substituted by Synthetic?
A key practical question is how much annotated real speech synthetic audio can replace without degrading ASR\. We study this from two angles\. The*fixed\-budget*regime varies the real/synthetic split in 10% increments at a constant total budget \(real \+ synth≈\\approx100%\), with BUT RIRs throughout; the complementary*fixed\-synth*regime keeps the full synthetic set \(100%\) and progressively adds real data, reporting both raw and RIR\-augmented synthetic audio\. Table[IV](https://arxiv.org/html/2606.29031#S5.T4)reports %WER on the real held\-out test set\.
The relationship is non\-monotonic\. Only the 90/10 \(8\.46%\) and 70/30 \(8\.45%\) mixtures with RIRs improve over the all\-real baseline \(8\.68%\), whereas the 80/20 split \(8\.85%\) and all mixtures exceeding 40% synthetic data degrade performance, reaching 9\.31% at equal proportions\. Even in the extreme 10%\-real / 90%\-synthetic regime, however, the resulting system \(10\.26%\) still outperforms the unadapted base model \(10\.90%\), indicating that RIR\-augmented synthetic data remains beneficial under severe data scarcity\. We conclude that a modest synthetic proportion of 10 to 30% is optimal\. It matches or exceeds the all\-real baseline while reducing the real\-data requirement, whereas substituting more than 30% of real recordings incurs a measurable WER penalty\.
### V\-BHow Much Synthetic Data Can Augment Real Speech?
Rather than replacing real recordings, one may ask how much synthetic data can be added on top of the full real corpus\. Table[V](https://arxiv.org/html/2606.29031#S5.T5)reports WER as the synthetic proportion grows while keeping all real data fixed\.
TABLE V:Augmentation: %WER on real test set as synthetic data is added to 100% real speech, raw synthetic vs\. RIR\-augmented synthetic\.All RIR\-augmented conditions outperform the real\-only baseline\. Notably, even 10% synthetic with RIRs \(8\.43%\) outperforms any quantity of raw synthetic data\. Raw synthetic data shows diminishing returns beyond 50%, while RIR\-augmented data consistently improves with scale up to 100% synthetic\. This confirms that acoustic environment diversity, not sheer data volume, is the driving factor\.
### V\-CDo Interpretability\-Guided Filters Improve ASR?
The layer\-wise analysis flagged Time Stretch, Pitch Shift, and Band Pass as effective at reducing synthetic/real separation, while High Pass alone was not\. Table[VI](https://arxiv.org/html/2606.29031#S5.T6)tests whether this translates to ASR gains on a 50%\-real / 50%\-synthetic split, varying the filter applied to synthetic audio\. We add Low Pass and its combinations, since Low Pass achieved the highest layer 28 overlap in the interpretability analysis \(Table[III](https://arxiv.org/html/2606.29031#S4.T3)\)\.
Among individual filters, Pitch Shift gives the largest gain \(−0\.23%\-0\.23\\%WER without RIRs,−0\.40%\-0\.40\\%with\), matching the interpretability finding that pitch is a key early\-layer cue\. Time Stretch without RIRs slightly hurts \(\+0\.15%\+0\.15\\%\), suggesting temporal distortion introduces mismatches that outweigh its layer\-level masking benefit\. Crucially, layer 28 overlap does not predict ASR benefit: all conditions converge there regardless of their effect at earlier layers, so the interpretability signal lives in the early\-to\-middle layers, not the final one\. Among combinations, Low Pass \+ Time Stretch gives the best no\-RIR result \(8\.68%\), matching the all\-real baseline without any acoustic augmentation, but benefits little from RIRs\. Low\-pass filtering and reverberation appear to target overlapping cues\. High Pass \+ Pitch Shift is instead the best condition with RIRs \(8\.58%\), retaining high\-frequency content is more compatible with reverberation than suppressing it, indicating some filters act on cues not captured by the silhouette analysis\.
TABLE VI:%WER on real test set\. Effect of signal\-level perturbations on ASR using 50% real / 50% synthetic training split\.
### V\-DDoes Layer\-wise Weighted Pooling Help in Low\-Resource Settings?
TABLE VII:%WER on real test set\. Layer\-wise weighted pooling \(LWP\) in the substitution setting \(100% synthetic\)\. “−\-res\.” drops the speech\-token acoustic residual\.Mechanism\.The standard SLAM\-ASR decoder feeds only the final LLM transformer layer’s hidden states to language modeling head\. We replace this with a layer\-wise weighted pooling \(LWP\) module that learns, per token and per utterance, a softmax\-weighted combination of all LLM layers:
z\(t\)=∑lwl\(t\)hl\(t\),wl\(t\)=softmaxl\(𝐬⊤hl\(t\)\)z\(t\)=\\sum\_\{l\}w\_\{l\}\(t\)\\,h\_\{l\}\(t\),\\quad w\_\{l\}\(t\)=\\operatorname\{softmax\}\_\{l\}\\\!\\bigl\(\\mathbf\{s\}^\{\\top\}h\_\{l\}\(t\)\\bigr\)\(1\)where𝐬∈ℝD\\mathbf\{s\}\\in\\mathbb\{R\}^\{D\}is a zero\-initialised score vector \(the only new trainable parameter\)\. Zero initialisation is a neutral prior, the initial softmax is a uniform1/L1/Lmixture over layers, close to a simple average, without biasing toward any depth\. An additional speech residual re\-injects the projected encoder output at speech\-token positions after pooling, preserving a direct acoustic information stream to the LM head\.
Substitution results\.Table[VII](https://arxiv.org/html/2606.29031#S5.T7)compares %WER for the substitution scenario with and without LWP\. Without RIRs, LWP helps only at low real data fractions\.𝐬\\mathbf\{s\}learns useful layer preferences from as little as 13\.6 h \(25%,−0\.48%\-0\.48\\%WER\) but lacks signal at 10% \(5\.4 h\) and adds nothing once real data is abundant\. With RIRs the LWP is far stronger, at 25% real, LWP\+RIRs reaches 8\.70%, matching nearly the 100% real baseline \(8\.68%\), and every setting with≥\\geq25% of real audios beats it \(best 8\.23% at 100% \+ 100%\)\. We attribute this to RIRs narrowing the acoustic gap, letting𝐬\\mathbf\{s\}learn stable preferences that transfer to the real test domain\.
Residual ablation\.Ablating the speech\-token residual has no meaningful impact at any real data fraction, with≤\\leq0\.01% absolute as shown in Table[VII](https://arxiv.org/html/2606.29031#S5.T7)\. The pooling mechanism itself captures the relevant acoustic information and the residual stream is redundant since WavLM features already enter via the projector before the LLM layers and are reachable into initial layers of the LLM\.
Augmentation results\.Table[VIII](https://arxiv.org/html/2606.29031#S5.T8)shows that LWP without RIRs mostly hurts\. With 54 h of real data, the decoder already has well\-calibrated last\-layer representations, so the pooling weights only add noise\. Adding RIRs recovers small gains at 10 to 50% synthetic \(e\.g\. 8\.73%→\\to8\.37% at 10%\) but degrades slightly at≥\\geq75%, where the processed synthetic utterances dominate and disrupt score learning\. LWP therefore helps most in low\-resource, substitution\-heavy settings, not in augmentation, where abundant real data already anchors the decoder\.
TABLE VIII:%WER on real test set\. Layer\-wise weighted pooling in the augmentation setting \(100% real \+ X% synthetic\)Layer weight analysis\.Table[IX](https://arxiv.org/html/2606.29031#S5.T9)reports the mean softmax weights learned for the 25%\-real / 100%\-synthetic condition, by token type, on both real and synthetic test sets\. The weights concentrate strikingly on layer 28, speech tokens receive mean weight 0\.977 \(synthetic\) and 0\.930 \(real\), far above the uniform1/28≈0\.0361/28\\approx 0\.036, with entropy as low as 0\.161 nats \(vs\. 3\.332 uniform\) confirming near deterministic selection\. Text tokens are less peaky \(0\.652/0\.643; 1\.303 nats\), spreading residual weight near uniformly over layers 1 to 27\.
This is consistent with the interpretability findings but answers a different question\. Section[IV](https://arxiv.org/html/2606.29031#S4)showed early\-to\-middle layers \(0 to 14\) are most discriminative between real and synthetic\. The LWP weights show the final layer is most useful for decoding, by layer 28 the model has refined acoustic input into a linguistically rich form optimal for transcription, regardless of origin\. The two are complementary, early layers encode the domain gap, the final layer encodes what matters for output\. The near identical profiles on both test sets confirm LWP learns domain\-agnostic preferences, explaining why it generalises once RIRs reduce the acoustic gap\.
TABLE IX:Learned LWP attention on layer 28 by token type, for the 25% real / 100% synthetic model \(uniform baseline1/28≈0\.0361/28\\approx 0\.036, test set, lower entropy = more concentrated\)\.
## VIConclusion
The interpretability and ASR analyses prove complementary rather than redundant, synthetic/real discrimination concentrates in early\-to\-middle layers \(0 to 14\), yet LWP selects the final layer for decoding, early layers encode the domain gap while the final layer encodes the semantic part which matters for the task\. RIR convolution dominates throughout, not by improving audio quality \(UTMOS and PESQ show it makes synthetic audio less pristine\) but by reproducing the acoustic irregularities of real telephone recordings, and LWP transfers to real audio only once RIRs supply this stable acoustic distribution\.
Concretely, combining interpretability\-guided filters \(Pitch Shift, High Pass \+ Pitch Shift\) with RIR augmentation reaches 8\.01% WER on the test set, beating the all\-real baseline by 7\.72% relative \(0\.67% absolute\), and adding layer\-wise weighted pooling lets a substitution\-oriented system match that baseline with only 25% real data \(13\.6 h\) and surpass it at every higher fraction\.
Since filter effectiveness is likely tied to the encoder, future work will scale these findings to larger synthetic mixtures, additional TTS systems, alternative encoders, and multilingual settings\. A further direction is to move beyond supervised fine\-tuning and leverage synthetic data through reinforcement learning, optimising the ASR policy directly against WER\. Recent critic\-free methods such as Group Relative Policy Optimization \(GRPO\)\[[43](https://arxiv.org/html/2606.29031#bib.bib43)\]could exploit the large pool of synthetic and RIR\-augmented utterances to extract further signal from synthetic data\. All the code is available on GitHub\.111Repository withheld for blind review\.
## References
- \[1\]European Parliament and Council of the European Union, “Regulation \(eu\) 2024/1689 of the european parliament and of the council of 13 june 2024 laying down harmonised rules on artificial intelligence \(artificial intelligence act\),” Official Journal of the European Union, OJ L, 2024/1689, 12\.7\.2024, 2024,[http://data\.europa\.eu/eli/reg/2024/1689/oj](http://data.europa.eu/eli/reg/2024/1689/oj)\.
- \[2\]X\. Guo, Y\. Xie, H\. Cheng, J\. Zhou, J\. Liu, H\. Huang, L\. Ye, and Q\. Zhang, “Towards explicit acoustic evidence perception in audio llms for speech deepfake detection,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2601\.23066](https://arxiv.org/abs/2601.23066)
- \[3\]C\. Tang, W\. Yu, G\. Sun, X\. Chen, T\. Tan, W\. Li, L\. Lu, Z\. MA, and C\. Zhang, “SALMONN: Towards generic hearing abilities for large language models,” in*The Twelfth International Conference on Learning Representations*, 2024\. \[Online\]\. Available:[https://openreview\.net/forum?id=14rn7HpKVk](https://openreview.net/forum?id=14rn7HpKVk)
- \[4\]Y\. Chu, J\. Xu, X\. Zhou, Q\. Yang, S\. Zhang, Z\. Yan, C\. Zhou, and J\. Zhou, “Qwen\-audio: Advancing universal audio understanding via unified large\-scale audio\-language models,” 2023\. \[Online\]\. Available:[https://arxiv\.org/abs/2311\.07919](https://arxiv.org/abs/2311.07919)
- \[5\]P\. K\. Rubenstein, C\. Asawaroengchai, D\. D\. Nguyen, A\. Bapna, Z\. Borsos, F\. de Chaumont Quitry, P\. Chen, D\. E\. Badawy, W\. Han, E\. Kharitonov, H\. Muckenhirn, D\. Padfield, J\. Qin, D\. Rozenberg, T\. Sainath, J\. Schalkwyk, M\. Sharifi, M\. T\. Ramanovich, M\. Tagliasacchi, A\. Tudor, M\. Velimirović, D\. Vincent, J\. Yu, Y\. Wang, V\. Zayats, N\. Zeghidour, Y\. Zhang, Z\. Zhang, L\. Zilka, and C\. Frank, “Audiopalm: A large language model that can speak and listen,” 2023\. \[Online\]\. Available:[https://arxiv\.org/abs/2306\.12925](https://arxiv.org/abs/2306.12925)
- \[6\]Z\. Ma, G\. Yang, Y\. Yang, Z\. Gao, J\. Wang, Z\. Du, F\. Yu, Q\. Chen, S\. Zheng, S\. Zhang, and X\. Chen, “Speech recognition meets large language model: benchmarking, models, and exploration,” in*Proceedings of the Thirty\-Ninth AAAI Conference on Artificial Intelligence and Thirty\-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Artificial Intelligence*, ser\. AAAI’25/IAAI’25/EAAI’25\. AAAI Press, 2025\. \[Online\]\. Available:[https://doi\.org/10\.1609/aaai\.v39i23\.34666](https://doi.org/10.1609/aaai.v39i23.34666)
- \[7\]S\. Chen, C\. Wang, Z\. Chen, Y\. Wu, S\. Liu, Z\. Chen, J\. Li, N\. Kanda, T\. Yoshioka, X\. Xiao, J\. Wu, L\. Zhou, S\. Ren, Y\. Qian, Y\. Qian, J\. Wu, M\. Zeng, X\. Yu, and F\. Wei, “Wavlm: Large\-scale self\-supervised pre\-training for full stack speech processing,”*IEEE Journal of Selected Topics in Signal Processing*, vol\. 16, no\. 6, pp\. 1505–1518, 2022\.
- \[8\]A\. Radford, J\. W\. Kim, T\. Xu, G\. Brockman, C\. McLeavey, and I\. Sutskever, “Robust speech recognition via large\-scale weak supervision,” in*Proceedings of the 40th International Conference on Machine Learning*, ser\. ICML’23\. JMLR\.org, 2023\.
- \[9\]K\.\-H\. Lu, S\.\-W\. Fu, C\.\-H\. H\. Yang, Z\. Chen, S\.\-F\. Huang, C\.\-K\. Yang, Y\.\-C\. Lin, C\.\-Y\. Hsiao, W\. Ren, E\.\-P\. Hu, Y\.\-H\. Huang, A\.\-Y\. Cheng, C\.\-H\. Chiang, Y\. Tsao, Y\.\-C\. F\. Wang, and H\. yi Lee, “How auditory knowledge in llm backbones shapes audio language models: A holistic evaluation,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2603\.19195](https://arxiv.org/abs/2603.19195)
- \[10\]N\. Rossenbach, A\. Zeyer, R\. Schlüter, and H\. Ney, “Generating synthetic audio data for attention\-based speech recognition systems,” in*ICASSP 2020 \- 2020 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*, 2020, pp\. 7069–7073\.
- \[11\]A\. Rosenberg, Y\. Zhang, B\. Ramabhadran, Y\. Jia, P\. Moreno, Y\. Wu, and Z\. Wu, “Speech recognition with augmented synthesized speech,” in*2019 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\)*, 2019, pp\. 996–1002\.
- \[12\]C\. Wang, A\. Wu, J\. Pino, A\. Baevski, M\. Auli, and A\. Conneau, “Large\-Scale Self\- and Semi\-Supervised Learning for Speech Translation,” in*Interspeech 2021*, 2021, pp\. 2242–2246\.
- \[13\]G\. Yang, F\. Yu, Z\. Ma, Z\. Du, Z\. Gao, S\. Zhang, and X\. Chen, “Enhancing low\-resource asr through versatile tts: Bridging the data gap,” 2024\. \[Online\]\. Available:[https://arxiv\.org/abs/2410\.16726](https://arxiv.org/abs/2410.16726)
- \[14\]Y\. Perrin and G\. Boulianne, “Towards improved speech recognition through optimized synthetic data generation,” 2025\. \[Online\]\. Available:[https://arxiv\.org/abs/2508\.21631](https://arxiv.org/abs/2508.21631)
- \[15\]P\. Srinivasa Varadhan, S\. Thomas, S\. Teja M S, S\. Bhooshan, and M\. M\. Khapra, “The State Of TTS: A Case Study with Human Fooling Rates,” in*Interspeech 2025*, 2025, pp\. 2285–2289\.
- \[16\]H\. Su, H\. Farn, F\.\-Y\. Sun, S\.\-T\. Chen, and H\.\-y\. Lee, “Task arithmetic can mitigate synthetic\-to\-real gap in automatic speech recognition,” in*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing*, Y\. Al\-Onaizan, M\. Bansal, and Y\.\-N\. Chen, Eds\. Miami, Florida, USA: Association for Computational Linguistics, Nov\. 2024, pp\. 8905–8915\. \[Online\]\. Available:[https://aclanthology\.org/2024\.emnlp\-main\.503/](https://aclanthology.org/2024.emnlp-main.503/)
- \[17\]C\.\-K\. Chou, C\.\-J\. Hsu, H\.\-L\. Chung, L\.\-H\. Tseng, H\.\-C\. Cheng, Y\.\-K\. Fu, K\. P\. Huang, and H\.\-Y\. Lee, “A self\-refining framework for enhancing asr using tts\-synthesized data,” 2025\. \[Online\]\. Available:[https://arxiv\.org/abs/2506\.11130](https://arxiv.org/abs/2506.11130)
- \[18\]X\. Tan, J\. Chen, H\. Liu, J\. Cong, C\. Zhang, Y\. Liu, X\. Wang, Y\. Leng, Y\. Yi, L\. He, S\. Zhao, T\. Qin, F\. Soong, and T\.\-Y\. Liu, “Naturalspeech: End\-to\-end text\-to\-speech synthesis with human\-level quality,”*IEEE Transactions on Pattern Analysis and Machine Intelligence*, vol\. 46, no\. 6, pp\. 4234–4245, 2024\.
- \[19\]J\. Mishra, M\. Chhibber, H\. jin Shim, and T\. H\. Kinnunen, “Towards explainable spoofed speech attribution and detection:a probabilistic approach for characterizing speech synthesizer components,” 2025\. \[Online\]\. Available:[https://arxiv\.org/abs/2502\.04049](https://arxiv.org/abs/2502.04049)
- \[20\]X\. Wang, J\. Yamagishi, M\. Todisco, H\. Delgado, A\. Nautsch, N\. Evans, M\. Sahidullah, V\. Vestman, T\. Kinnunen, K\. A\. Lee, L\. Juvela, P\. Alku, Y\.\-H\. Peng, H\.\-T\. Hwang, Y\. Tsao, H\.\-M\. Wang, S\. L\. Maguer, M\. Becker, F\. Henderson, R\. Clark, Y\. Zhang, Q\. Wang, Y\. Jia, K\. Onuma, K\. Mushika, T\. Kaneda, Y\. Jiang, L\.\-J\. Liu, Y\.\-C\. Wu, W\.\-C\. Huang, T\. Toda, K\. Tanaka, H\. Kameoka, I\. Steiner, D\. Matrouf, J\.\-F\. Bonastre, A\. Govender, S\. Ronanki, J\.\-X\. Zhang, and Z\.\-H\. Ling, “Asvspoof 2019: A large\-scale public database of synthesized, converted and replayed speech,”*Computer Speech & Language*, vol\. 64, p\. 101114, 2020\. \[Online\]\. Available:[https://www\.sciencedirect\.com/science/article/pii/S0885230820300474](https://www.sciencedirect.com/science/article/pii/S0885230820300474)
- \[21\]A\. Dvirniak, E\. Kushnir, D\. Tarasov, A\. Iudin, O\. Kiriukhin, M\. Pautov, D\. Korzh, and O\. Y\. Rogov, “Towards robust speech deepfake detection via human\-inspired reasoning,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2603\.10725](https://arxiv.org/abs/2603.10725)
- \[22\]S\. Baroudi, T\. Pellegrini, and H\. Bredin, “Specializing Self\-Supervised Speech Representations for Speaker Segmentation,” in*Interspeech 2024*, 2024, pp\. 3769–3773\.
- \[23\]S\. Zaiem, Y\. Kemiche, T\. Parcollet, S\. Essid, and M\. Ravanelli, “Speech Self\-Supervised Representation Benchmarking: Are We Doing it Right?” in*Interspeech 2023*, 2023, pp\. 2873–2877\.
- \[24\]S\. Baroudi, H\. Bredin, J\. Razik, and R\. Marxer, “On the use of self\-supervised representation learning for speaker diarization and separation,” in*2025 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\)*, 2025, pp\. 1–7\.
- \[25\]A\. Pasad, J\.\-C\. Chou, and K\. Livescu, “Layer\-wise analysis of a self\-supervised speech representation model,” in*2021 IEEE Automatic Speech Recognition and Understanding Workshop \(ASRU\)*, 2021, pp\. 914–921\.
- \[26\]S\. wen Yang, P\.\-H\. Chi, Y\.\-S\. Chuang, C\.\-I\. J\. Lai, K\. Lakhotia, Y\. Y\. Lin, A\. T\. Liu, J\. Shi, X\. Chang, G\.\-T\. Lin, T\.\-H\. Huang, W\.\-C\. Tseng, K\. tik Lee, D\.\-R\. Liu, Z\. Huang, S\. Dong, S\.\-W\. Li, S\. Watanabe, A\. Mohamed, and H\. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in*Interspeech 2021*, 2021, pp\. 1194–1198\.
- \[27\]M\.\-H\. Hsu, X\. Zhang, X\. Tian, J\. Zhang, and Z\. Wu, “Anatomy of the modality gap: Dissecting the internal states of end\-to\-end speech llms,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2603\.01502](https://arxiv.org/abs/2603.01502)
- \[28\]A\. Grattafiori*et al\.*, “The llama 3 herd of models,” 2024\. \[Online\]\. Available:[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)
- \[29\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen, “LoRA: Low\-rank adaptation of large language models,” in*International Conference on Learning Representations*, 2022\. \[Online\]\. Available:[https://openreview\.net/forum?id=nZeVKeeFYf9](https://openreview.net/forum?id=nZeVKeeFYf9)
- \[30\]S\. Kumar, E\. Villatoro\-Tello, S\. Burdisso, K\. Hacioglu, T\. Bañeras\-Roux, H\. Watawana, D\. Sanchez\-Cortes, S\. Madikeri, P\. Motlicek, and A\. Stolcke, “Distilling conversations: Abstract compression of conversational audio context for llm\-based asr,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2603\.26246](https://arxiv.org/abs/2603.26246)
- \[31\]A\. Carofilis, S\. Burdisso, E\. Villatoro\-Tello, S\. Kumar, K\. Hacioglu, S\. Madikeri, P\. Rangappa, M\. K\. E, P\. Motlicek, S\. Venkatesan, and A\. Stolcke, “Text\-only adaptation in llm\-based asr through text denoising,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2601\.20900](https://arxiv.org/abs/2601.20900)
- \[32\]H\. Hu, X\. Zhu, T\. He, D\. Guo, B\. Zhang, X\. Wang, Z\. Guo, Z\. Jiang, H\. Hao, Z\. Guo, X\. Zhang, P\. Zhang, B\. Yang, J\. Xu, J\. Zhou, and J\. Lin, “Qwen3\-tts technical report,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2601\.15621](https://arxiv.org/abs/2601.15621)
- \[33\]Z\. Du, Q\. Chen, S\. Zhang, K\. Hu, H\. Lu, Y\. Yang, H\. Hu, S\. Zheng, Y\. Gu, Z\. Ma, Z\. Gao, and Z\. Yan, “Cosyvoice: A scalable multilingual zero\-shot text\-to\-speech synthesizer based on supervised semantic tokens,” 2024\. \[Online\]\. Available:[https://arxiv\.org/abs/2407\.05407](https://arxiv.org/abs/2407.05407)
- \[34\]E\. Casanova, K\. Davis, E\. Gölge, G\. Göknar, I\. Gulea, L\. Hart, A\. Aljafari, J\. Meyer, R\. Morais, S\. Olayemi, and J\. Weber, “XTTS: a Massively Multilingual Zero\-Shot Text\-to\-Speech Model,” in*Interspeech 2024*, 2024, pp\. 4978–4982\.
- \[35\]Y\. Lacombe, V\. Srivastav, and S\. Gandhi, “Parler\-tts,”[https://github\.com/huggingface/parler\-tts](https://github.com/huggingface/parler-tts), 2024\.
- \[36\]S\. Zhou, Y\. Zhou, Y\. He, X\. Zhou, J\. Wang, W\. Deng, and J\. Shu, “Indextts2: A breakthrough in emotionally expressive and duration\-controlled auto\-regressive zero\-shot text\-to\-speech,” 2025\. \[Online\]\. Available:[https://arxiv\.org/abs/2506\.21619](https://arxiv.org/abs/2506.21619)
- \[37\]H\. Zhu, L\. Ye, W\. Kang, Z\. Yao, L\. Guo, F\. Kuang, Z\. Han, W\. Zhuang, L\. Lin, and D\. Povey, “Omnivoice: Towards omnilingual zero\-shot text\-to\-speech with diffusion language models,” 2026\. \[Online\]\. Available:[https://arxiv\.org/abs/2604\.00688](https://arxiv.org/abs/2604.00688)
- \[38\]Resemble AI, “Chatterbox\-TTS,”[https://github\.com/resemble\-ai/chatterbox](https://github.com/resemble-ai/chatterbox), 2025, gitHub repository\.
- \[39\]T\. Ko, V\. Peddinti, D\. Povey, M\. L\. Seltzer, and S\. Khudanpur, “A study on data augmentation of reverberant speech for robust speech recognition,” in*2017 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\)*, 2017, pp\. 5220–5224\.
- \[40\]I\. Szöke, M\. Skácel, L\. Mošner, J\. Paliesek, and J\. Černocký, “Building and evaluation of a real room impulse response dataset,”*IEEE Journal of Selected Topics in Signal Processing*, vol\. 13, no\. 4, pp\. 863–876, 2019\.
- \[41\]H\. Bredin, “pyannote\.audio 2\.1 speaker diarization pipeline: principle, benchmark, and recipe,” in*Interspeech 2023*, 2023, pp\. 1983–1987\.
- \[42\]T\. Saeki, D\. Xin, W\. Nakata, T\. Koriyama, S\. Takamichi, and H\. Saruwatari, “UTMOS: UTokyo\-SaruLab System for VoiceMOS Challenge 2022,” in*Interspeech 2022*, 2022, pp\. 4521–4525\.
- \[43\]Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo, “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” 2024\. \[Online\]\. Available:[https://arxiv\.org/abs/2402\.03300](https://arxiv.org/abs/2402.03300)Similar Articles
Can Large Language Models Imitate Human Speech for Clinical Assessment? LLM-Driven Data Augmentation for Cognitive Score Prediction
This paper proposes a large language model-driven data augmentation framework using GPT-5 to generate synthetic oral monologues from written anchors for cognitive score prediction from speech. A similarity-guided selection strategy consistently reduces prediction error, particularly for minority low-score participants.
Streaming Speech-to-Text Translation with a SpeechLLM
Presents a SpeechLLM architecture for streaming speech-to-text translation that adaptively decides when to output tokens based on audio, achieving 1-2 second latency with quality close to non-streaming baselines.
Are you speaking my languages? On spoken language adherence in multimodal LLMs
This paper addresses the problem of spoken language adherence in multimodal LLMs for ASR, proposing a soft prompting approach and novel metric to quantify language violations. It evaluates three mitigation strategies—zero-shot prompting, supervised fine-tuning, and chain-of-thought reasoning—across multiple languages to improve transcription fidelity.
Contrastive Training with LLM-generated Near-Misses for Robust Code-Switching Speech Recognition
Proposes a POI-aware contrastive training framework using LLM-generated near-misses to improve ASR robustness at code-switching regions, achieving consistent error reductions on two benchmarks.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
Proposes TextPro-SLM, a speech large language model that minimizes the modality gap by processing spoken input to resemble prosody-aware text input, achieving strong paralinguistic understanding with low training data.