HybridCodec: Modeling Discrete and Continuous Representations for Efficient Speech Language Models
Summary
Proposes HybridCodec, a novel framework combining temporally compressed discrete tokens with continuous residuals to improve speaker characteristic retention in speech language models, reducing autoregressive steps while maintaining quality.
View Cached Full Text
Cached at: 06/29/26, 05:24 AM
# Modeling Discrete and Continuous Representations For Efficient Speech Language Models Source: [https://arxiv.org/html/2606.27627](https://arxiv.org/html/2606.27627) Ploujnikov Verdini Sadok Ravanelli FrancescoSamirMirco1Mila, Quebec AI Institute, Canada;2Concordia University, Canada; 3Sapienza University of Rome, Italy;4Inria, Université Grenoble Alpes CNRS, LJK, France[artem\.ploujnikov@mail\.concordia\.ca, francesco\.verdini@uniroma1\.it, samir\.sadok@inria\.fr, mirco\.ravanelli@mail\.concordia\.ca](https://arxiv.org/html/2606.27627v1/mailto:[email protected],%[email protected],%20) ###### Abstract Discrete audio representations have become increasingly popular for building multimodal text\-audio systems and integrating audio capabilities into Large Language Models \(LLMs\)\. However, numerous studies report performance degradation on various downstream tasks due to information loss during discretization\. To address this, we propose a novel approach combining temporally compressed discrete tokens with dimensionality\-reduced continuous residuals\. Our framework consists of a hybridized discrete\-continuous focal modulation codec and a hybrid Transformer\. This architecture performs autoregressive inference in the discrete domain, coupled with non\-autoregressive prediction and continuous residual upsampling\. Experimental results show that our approach significantly improves the retention of speaker characteristics compared to discrete\-only methods, while simultaneously reducing the number of required autoregressive steps\. ###### keywords: speech recognition, speech synthesis, text\-to\-speech, audio representations, neural audio codecs\. ## 1Introduction The human mind processes the world through a complex interplay of discrete categories and continuous spectra\[discrete\-continuous\-brain,attractor\-integrator\]\. Human language perfectly illustrates this duality\. It imposes a clear*discrete*hierarchy \(sequences of phonemes forming words and sentences captured in alphabets or logographic systems\) onto a rich modulation of*continuous*characteristics, such as pitch, tone, emotion, and prosody\. The advent of the Transformer architecture\[transformer\]established discrete token sequences as thede factomedium for modern artificial intelligence\. This paradigm, which drives autoregressive generation and Large Language Models \(LLMs\)\[gpt,llama,gemini\], was subsequently adapted for the audio domain\. Pioneering architectures like the Vector Quantized Variational Autoencoder \(VQ\-VAE\)\[vqvae\]demonstrated that continuous information can be effectively compressed into a discrete latent space, motivating the development of neural audio codecs \(NACs\)\[dates,kyutai2024moshi,xin2024bigcodec,dac\]\. Fundamentally, a NAC comprises an encoder, a vector quantizer, and a decoder that map continuous audio into low\-bitrate discrete tokens and back to the waveform\. Unlike traditional codecs such as MP3 that rely on algorithmic signal processing and psychoacoustics, NACs learn a finite or variable data\-driven*vocabulary*of sounds\. This allows them to achieve extreme compression rates while preserving rich semantic and acoustic features, effectively bridging the gap between raw signal processing and natural language modeling by enabling LLMs to process speech as natively as text\. Models like AudioLM \(semantic and acoustic modeling\)\[borsos2023audiolm\], VALL\-E \(zero\-shot voice cloning\)\[valle\], and SpeechGPT \(cross\-model speech\-text LLM\)\[speechgpt\]have successfully leveraged these discrete audio tokens to drive significant breakthroughs in zero\-shot speech synthesis and end\-to\-end multimodal dialogue\. Despite their advantages, fully discrete representations introduce an inherent quantization penalty\. As evidenced by Benchmarks \(e\.g\., SUPERB\[superb\], DASB\[dasb\-benchmark\]\) and recent comparative surveys\[dates,speechdt,kammoun2025modeling\]highlight a fundamental trade\-off: while discrete tokens facilitate stable convergence and seamless LLM integration, the quantization process irreversibly discards fine\-grained acoustic details\. Fundamentally, this loss stems from the classic rate\-distortion trade\-off\[cover1999elements,shannon1959coding\]\. At low bit\-rates, NACs prioritize semantic content intelligibility over acoustic richness, lacking the bandwidth to encode micro\-prosody and speaker timbre\[dates\]\. To mitigate this limitation, we propose a novel*hybrid*paradigm in which the codec supports optional refinements through high\-frame\-rate continuous residuals, and an LM can start with a lossy, low\-resolution approximation and then compute a one\-step continuous refinement, vastly reducing the total number of forward passes required in inference\. Our main contributions are as follows: \(1\)HybridCodec, a novel NAC framework extending FocalCodec\[focalcodec,focalcodec\-streaming\], which jointly extracts time\-reduced discrete tokens and models the remaining information as dimensionality\-reduced continuous residuals; \(2\)HybridLM, a decoder\-only Transformer\[transformer\]designed to process these hybrid representations\. It unifies efficient, low\-frame rate autoregressive \(AR\) prediction for discrete tokens with a single\-step non\-autoregressive \(NAR\) prediction and continuous residual upsampling; \(3\) A unified framework that leverages the HybridLM architecture to effectively handle major downstream speech tasks, including ASR and TTS, within a single framework\. Figure 1:Overview of the proposed architecture:HybridCodec\(left\) provides dual\-path discrete\-continuous compression, andHybridLM\(right\) unifies these representations through interleaved autoregressive and non\-autoregressive decoding\.This hybrid paradigm restores the fine\-grained information lost in discrete LMs\. Experimental results on LibriTTS\[libritts\]dataset showed that our approach significantly outperforms discrete baselines, especially at extremely low frame rates such as 6\.25 Hz, while substantially reducing AR steps\. ## 2Related Work Recent work has been responding to the discrete\-continuous performance gap with further task\-specific analysis and a variety of adaptations\. Studies in ASR\[discrete\-continuous\-asr\]confirm the gap, showing that such an information bottleneck directly degrades downstream performance by stripping the signal of its prosodic nuance and speaker identity\. To overcome this limitation, recent literature has explored re\-integrating continuous features through diffusion mechanisms, continuous autoregressive modeling, or masked modeling\[clear\-tts,spear\-tts,sadok2026residual\]\. However, these approaches remain highly task\-specific, sacrificing the unified, generalizable framework that discrete LLMs provide\. This limits their ability to handle diverse speech applications \(like generation and recognition\) within a single model\. Discrete\-continuous hybridization has been successfully explored in other domains, such as RL and robotics\[hyar,discrete\-continuous\-em,discrete\-continuous\-robot\], text diffusion\[discrete\-continuous\-diffusion\], and others\. This raises a critical question:is it possible to design a unified language model that leverages the efficiency of discrete tokens while restoring the rich acoustic nuances of continuous speech?To the best of our knowledge, our approach is the first to unify discrete and continuous refinement within a single Transformer architecture\. By leveraging these two domains, we achieve high\-fidelity speech synthesis at ultra\-low frame rates\. ## 3Model Architecture ### 3\.1Preliminaries: The FocalCodec Architecture FocalCodec\[focalcodec\]employs an asymmetric VQ\-VAE architecture centered around a compressor\-quantizer\-decompressor bottleneck\. It uses the first six layers of a pretrained WavLM as a base encoder to extract jointly acoustic and semantic features\. Its core pipeline relies on*focal modulation*: a*Focal Encoder \(FE\)*\(compressor\) downsamples these continuous features into a compact latent space in linear time by aggregating multi\-scale global and local contexts \(noted as𝐱base\\mathbf\{x\}\_\{\\text\{base\}\}\)\. The representations are then discretized using Binary Spherical Quantization \(BSQ\)\[zhao2024image\], a lookup\-free approach that enforces bounded quantization errors and maximizes codebook utilization\. Then, a*Focal Decoder \(FD\)*\(decompressor\) mirrors the downscaling process to upsample the discrete tokens, and explicitly reconstruct the original continuous WavLM representations\. Finally, a lightweight Vocos decoder\[vocos\]synthesizes the audio waveform directly from these restored continuous features\. ### 3\.2HybridCodec: Extracting Hybrid Representations The HybridCodec, shown in Figure[1](https://arxiv.org/html/2606.27627#S1.F1)\(left\), extends FocalCodec\[focalcodec\]by adding a secondary pathway\. This branch, consisting of an additional focal encoder and decoder, captures and compresses the continuous residual information lost during discretization\. Encoding: Dual\-Path Feature Extraction\.The encoding process maps the base representations,𝐱base∈ℝT×d\\mathbf\{x\}\_\{\\text\{base\}\}\\in\\mathbb\{R\}^\{T\\times d\_\{\\text\{\}\}\}, into a dual discrete\-continuous latent space\. First, the*discrete pathway*\(highlighted in red in Fig\.[1](https://arxiv.org/html/2606.27627#S1.F1)\) extracts the quantized indices𝐳q=FQθ\(𝐱base\)\\mathbf\{z\}\_\{q\}=\\mathrm\{FQ\}\_\{\\theta\}\(\\mathbf\{x\}\_\{\\text\{base\}\}\)\. From these indices, we derive the quantized approximation𝐱^quant=BSQθ−1\(𝐳q\)\\hat\{\\mathbf\{x\}\}\_\{\\text\{quant\}\}=\\mathrm\{BSQ\}\_\{\\theta\}^\{\-1\}\(\\mathbf\{z\}\_\{q\}\)\. Second, the*continuous pathway*\(highlighted in green in Fig\.[1](https://arxiv.org/html/2606.27627#S1.F1)\) captures the fine\-grained acoustic details lost to quantization by computing the residual error:𝐱res=𝐱base−𝐱^quant\\mathbf\{x\}\_\{\\text\{res\}\}=\\mathbf\{x\}\_\{\\text\{base\}\}\-\\hat\{\\mathbf\{x\}\}\_\{\\text\{quant\}\}\. This continuous residual is compressed by a dedicated residual focal encoder,FEres\\mathrm\{FE\}\_\{\\text\{res\}\}, which applies a temporal down\-sampling striderrto yield a dimensionality\-reduced bottleneck representation:𝐱¯res=FEres\(𝐱res\)\\bar\{\\mathbf\{x\}\}\_\{\\text\{res\}\}=\\mathrm\{FE\}\_\{\\text\{res\}\}\(\\mathbf\{x\}\_\{\\text\{res\}\}\)\. To control the temporal resolution, we adjust the strides ofFEres\\mathrm\{FE\}\_\{\\text\{res\}\}:\(1,1,1\)\(1,1,1\)for50Hz50~\\text\{Hz\},\(2,1,1\)\(2,1,1\)for25Hz25~\\text\{Hz\},\(2,2,1\)\(2,2,1\)for12\.5Hz12\.5~\\text\{Hz\}, and\(2,2,2\)\(2,2,2\)for6\.25Hz6\.25~\\text\{Hz\}\. Decoding: Feature Fusion and Reconstruction\.The decoding process perfectly mirrors the encoding stages to reconstruct the full hybrid signal\. First, the*discrete pathway*projects the indices𝐳q\\mathbf\{z\}\_\{q\}back into the continuous embedding space via the inverse quantizer:𝐱^quant=FQθ−1\(𝐳q\)\\hat\{\\mathbf\{x\}\}\_\{\\text\{quant\}\}=\\mathrm\{FQ\}\_\{\\theta\}^\{\-1\}\(\\mathbf\{z\}\_\{q\}\)\. Second, the*continuous pathway*passes the bottleneck residual𝐱¯res\\bar\{\\mathbf\{x\}\}\_\{\\text\{res\}\}through a residual focal decoder,FDres\\mathrm\{FD\}\_\{\\text\{res\}\}\. This module upsamples the representation by the factorrrto restore the original temporal resolution:𝐱^res=FDres\(𝐱¯res\)∈ℝT×d\\hat\{\\mathbf\{x\}\}\_\{\\text\{res\}\}=\\mathrm\{FD\}\_\{\\text\{res\}\}\(\\bar\{\\mathbf\{x\}\}\_\{\\text\{res\}\}\)\\in\\mathbb\{R\}^\{T\\times d\_\{\\text\{\}\}\}\. Finally, the full representation is synthesized by adding both streams together before passing them to the Vocos decoder:𝐱^base=𝐱^quant\+𝐱^res\\hat\{\\mathbf\{x\}\}\_\{\\text\{base\}\}=\\hat\{\\mathbf\{x\}\}\_\{\\text\{quant\}\}\+\\hat\{\\mathbf\{x\}\}\_\{\\text\{res\}\}\. Table 1:Resynthesis performance between baseline codecs and our hybrid codec\.↑/↓\\uparrow/\\downarrowindicates higher/lower is better\.boldandseconddenote the best and second\-best results, respectively\.NACFrame rateUTMOS\(↑\\uparrow\)dWER\(↓\\downarrow\)SpkSim\(↑\\uparrow\)Code Usage\(↑\\uparrow\)Norm Entropy\(↑\\uparrow\)Reference—4\.090\.00100\.0——DAC\[kumar2023high\]50 Hz1\.2920\.0489\.2100\.091\.7Mimi\[kyutai2024moshi\]12\.5 Hz3\.295\.7396\.095\.691\.8BigCodec\[xin2024bigcodec\]50 Hz4\.112\.5598\.5100\.098\.6FocalCodec\[focalcodec\]12\.5 Hz4\.227\.9493\.998\.297\.4FocalCodec\[focalcodec\]25 Hz4\.143\.3096\.399\.898\.4HybridCodec50 Hz4\.071\.4797\.299\.996\.3HybridCodec25 Hz4\.071\.4896\.798\.896\.8HybridCodec12\.5 Hz4\.091\.4796\.297\.196\.7HybridCodec6\.25 Hz3\.981\.5097\.197\.498\.2 ### 3\.3HybridLM Architecture HybridLM is a GPT\-style\[gpt\]decoder\-only Transformer, illustrated in Figure[1](https://arxiv.org/html/2606.27627#S1.F1)\(right\), tailored to process the dual representations of HybridCodec \(Section[3\.2](https://arxiv.org/html/2606.27627#S3.SS2)\)\. It unifies autoregressive \(AR\) and non\-autoregressive \(NAR\) decoding within a single network: discrete tokens drive the AR phase to establish semantic structure, while continuous residuals are predicted in a NAR pass to recover high\-fidelity acoustic details\. Unlike VALL\-E\[valle\], our model supports mixed discrete\-continuous prompts at different temporal scales\. The model was designed to fully exploit HybridCodec features, including both semantic indices \(in AR mode\) and continuous residuals \(single\-pass NAR\)\. Unified AR and NAR Modeling via AdaLN\.Combining AR classification \(token generation\) and NAR regression \(residual prediction\) risks objective interference in deeper layers if relying on simple prefix conditioning\. To mitigate this, we employ Adaptive Layer Normalization \(AdaLN\) to multiplex both operational modes\. By injecting a mode\-specific embedding \(imode∈\{AR,NAR\}i\_\{\\text\{mode\}\}\\in\\\{\\text\{AR\},\\text\{NAR\}\\\}\) at every layer, AdaLN provides deep conditioning that dynamically adapts internal representations\. This effectively creates two specialized, interference\-free sub\-models within a shared backbone\[adaspeech,valle\]\. We train models with 12 layers, 4 attention heads,dmodel=demb=512d\_\{\\textrm\{model\}\}=d\_\{\\textrm\{emb\}\}=512, anddffn=2048d\_\{\\textrm\{ffn\}\}=2048\(inner dimension of the feed\-forward layers\)\. Given a decoding mode identifierimodei\_\{\\text\{mode\}\}, the AdaLN modulation parameters are computed as follows: 𝐞\\displaystyle\\mathbf\{e\}=Emb\(imode\)\\displaystyle=\\mathrm\{Emb\}\(i\_\{\\text\{mode\}\}\)\(Mode embedding\)𝜸\\displaystyle\\bm\{\\gamma\}=𝐖γ𝐞\+𝐛γ\\displaystyle=\\mathbf\{W\}\_\{\\gamma\}\\mathbf\{e\}\+\\mathbf\{b\}\_\{\\gamma\}\(Scaling vector\)𝜷\\displaystyle\\bm\{\\beta\}=𝐖β𝐞\+𝐛β\\displaystyle=\\mathbf\{W\}\_\{\\beta\}\\mathbf\{e\}\+\\mathbf\{b\}\_\{\\beta\}\(Bias vector\)𝐳¯\\displaystyle\\bar\{\\mathbf\{z\}\}=LayerNorm\(𝐳\)\\displaystyle=\\mathrm\{LayerNorm\}\(\\mathbf\{z\}\)\(Standard LN\)𝐳cond\\displaystyle\\mathbf\{z\}\_\{\\text\{cond\}\}=𝜸⊙𝐳¯\+𝜷\\displaystyle=\\bm\{\\gamma\}\\odot\\bar\{\\mathbf\{z\}\}\+\\bm\{\\beta\}\(Affine transform\)whereEmb\(⋅\)\\mathrm\{Emb\}\(\\cdot\)is a learned embedding layer mapping the discrete mode identifier to a continuous vector𝐞\\mathbf\{e\},𝐖γ\\mathbf\{W\}\_\{\\gamma\}and𝐖β\\mathbf\{W\}\_\{\\beta\}are learnable weight matrices,𝐛γ\\mathbf\{b\}\_\{\\gamma\}and𝐛β\\mathbf\{b\}\_\{\\beta\}are bias terms,z→\\vec\{z\}denotes a latent variable \(usually the output of the previous layer or of attention/FFN\) and⊙\\odotdenotes the Hadamard product\. Speaker Embeddings\.To condition the generation on a specific voice, we inject pretrained ECAPA\-TDNN\[ecapa\-tdnn\]speaker embeddings, extracted using the SpeechBrain\[speechbrain\_v1\]toolkit\. These embeddings are integrated via a simple linear projection and addition to all token embeddings in the source sequence\. Training Procedure\.Both the discrete and continuous paths are trained conventionally with teacher forcing, as in the original Transformer\[transformer\]and the two losses \(NLL for discrete and MSE for continuous\) are combined\. Cascaded Inference\.During the inference phase, the generation proceeds in a cascaded manner\. Given a task\-specific conditioning sequence𝐜\\mathbf\{c\}\(e\.g\. text, phonemes or an acoustic prefix\), the discrete tokens are first generated autoregressively\. Subsequently, the continuous residuals are predicted in a single non\-autoregressive forward pass: 𝐳^q\\displaystyle\\hat\{\\mathbf\{z\}\}\_\{q\}=AR\(𝐜\)\\displaystyle=\\mathrm\{AR\}\(\\mathbf\{c\}\)\(AR generation\)𝐡NAR\\displaystyle\\mathbf\{h\}\_\{\\text\{NAR\}\}=\[𝐜∥Up\(𝐳^q,r\)\]\\displaystyle=\\big\[\\mathbf\{c\}\\parallel\\mathrm\{Up\}\(\\hat\{\\mathbf\{z\}\}\_\{q\},r\)\\big\]\(Upsample & Concat\)𝐳^res\\displaystyle\\hat\{\\mathbf\{z\}\}\_\{\\text\{res\}\}=fSLT−1\(NAR\(𝐡NAR\)\)\\displaystyle=f\_\{\\textrm\{SLT\}\}^\{\-1\}\(\\mathrm\{NAR\}\(\\mathbf\{h\}\_\{\\text\{NAR\}\}\)\)\(NAR prediction\)𝐬^\\displaystyle\\hat\{\\mathbf\{s\}\}=Decoder\(𝐳^q,𝐳^res\)\\displaystyle=\\mathrm\{Decoder\}\(\\hat\{\\mathbf\{z\}\}\_\{q\},\\hat\{\\mathbf\{z\}\}\_\{\\text\{res\}\}\)\(Waveform synthesis\)whereAR\\mathrm\{AR\}denotes the autoregressive decoding of the discrete tokens, andNAR\\mathrm\{NAR\}represents the non\-autoregressive prediction of the continuous residuals\. The operator∥\\paralleldenotes sequence concatenation along the temporal dimension\. TheUp\\mathrm\{Up\}function aligns the temporal resolution of the generated discrete tokens𝐳^q\\hat\{\\mathbf\{z\}\}\_\{q\}with the continuous space using the defined up\-sampling raterr\. Finally, theDecoder\\mathrm\{Decoder\}module synthesizes the final audio waveform𝐬^\\hat\{\\mathbf\{s\}\}by combining both the discrete and continuous representations\. The signed\-log transformfSLT\(x\)=sign\(x\)log\(\|x\|\+1\)f\_\{\\textrm\{SLT\}\}\(x\)=sign\(x\)\\log\(\|x\|\+1\)\[slt\]is used to improve training dynamics\. With a downsampled discrete track, the generation time is significantly reduced:ncascade=nfull/r\+1n\_\{\\text\{cascade\}\}=n\_\{\\text\{full\}\}/r\+1, wherencascaden\_\{\\text\{cascade\}\}is the number of Transformer steps required with cascading inference,nfulln\_\{\\text\{full\}\}is the number of full autoregressive steps, andr=fbase/fmodelr=f\_\{\\textrm\{base\}\}/f\_\{model\}is the scaling factor\. For instance, generating a 10\-second sample at50Hz50~\\text\{Hz\}traditionally takes 500 steps, but using a 12\.5 Hz residual\-enhanced model500/4\+1=126500/4\+1=126steps without the significant quality loss of the discrete\-only model\. ## 4Experimental Setup Table 2:Comparison of downstream task \(TTS/ASR\) performance \(best results in bold\)We use the 960\-hour LibriTTS\[libritts\]dataset, an extension of LibriSpeech\[librispeech\]specifically optimized for TTS\. While we train on both the*clean*and*other*\(distorted\) subsets for training, we strictly limit our evaluation to the*clean*test set to maintain consistency\. To align with the original FocalCodec setup and avoid out\-of\-distribution artifacts, we exclude any audio samples exceeding 20 seconds\. During evaluation, ASR performance is computed on the full test set\. For TTS, we uniformly sample a single subset of 1,000 utterances\. We implement our framework using the SpeechBrain\[speechbrain\_v1\]toolkit\. For reproducibility and to support the community, all source code and models will be made publicly available within the SpeechBrain project111[https://speechbrain\.github\.io/](https://speechbrain.github.io/)\. ### 4\.1Metrics We use the objective evaluation metrics listed below: - •Audio quality and naturalness:We report*UTMOS*\[utmos\]and*NISQA*\[nisqa\], neural estimators of the Mean Opinion Score \(MOS, ranging from1\.01\.0to5\.05\.0, where higher is better\)\. While UTMOS evaluates perceived overall naturalness, NISQA specifically targets signal transmission quality\. - •Intelligibility:We measure robustness against mispronunciations and acoustic artifacts using the differential Word Error Rate \(*dWER*\)\. It is computed as the word\-level edit distance between ASR transcriptions of the synthesized audio and the ground truth\. Following standard benchmarks like DASB\[dasb\-benchmark\], we intentionally employ Whisper Small\[whisper\]with greedy decoding; a weaker ASR model is strictly preferable here, as it is less capable of implicitly compensating for underlying codec or synthesis flaws\. - •Speaker identity preservation:We quantify voice fidelity via*SpkSim*\(ranging from0\.00\.0to1\.01\.0, higher is better\)\. This metric calculates the cosine similarity between latent embeddings extracted from a pretrained WavLM model fine\-tuned for speaker verification \(WavLM\-SV\), ensuring the synthesized vocal characteristics strongly match the original target\. - •Quantization efficiency:To evaluate the effectiveness of discrete codebook use, we report*Code Usage*\(percentage of active codebook vectors\) and*Normalized Entropy*\(token uniformity\)\. High values indicate optimal vocabulary exploitation, preventing the index collapse typical of extreme low\-bitrate compression\. ### 4\.2Tasks We evaluate our approach across three tasks\. For TTS and ASR, input prompts and targets are combined into a single sequence using\[BOS\],\[EOP\]\(which separates the prompt from the generated target\), and\[EOS\]control tokens\. 1. 1\.Resynthesis:Evaluates the standalone reconstruction quality of the WavLM\-based HybridCodec by decoding ground\-truth discrete tokens and continuous residuals directly through the vocoder, without a language model\. 2. 2\.Text\-To\-Speech \(TTS\):The model generates audio from a text prompt \(\[BOS\]\[Chars\]\[EOP\]\)\. The target consists of AR\-generated discrete tokens followed by a NAR residual prediction pass\. Quality is assessed via UTMOS, dWER, and SpkSim\. 3. 3\.Automatic Speech Recognition \(ASR\):The model maps audio prompts \(discrete tokens and residuals\) to text \(\[Text\]\[EOS\]\)\. To isolate the impact of hybrid representations, we use simple greedy search, focusing on relative performance \(Word Error Rate–WER and Character Error Rate–CER\) rather than state\-of\-the\-art benchmarking\. ## 5Results We first evaluate the reconstruction capabilities of our codec through resynthesis, before assessing its performance on downstream tasks\. In both scenarios, we compare our hybrid approach against discrete\-only baselines\. Resynthesis:Table[1](https://arxiv.org/html/2606.27627#S3.T1)compares HybridCodec against state\-of\-the\-art NACs\[dac,kyutai2024moshi,focalcodec,xin2024bigcodec\]\. To our knowledge, ours is the first approach to maintain such high semantic and speaker preservation at ultra\-low frame rates \(6\.25 Hz\)\. While baselines like FocalCodec\[focalcodec\]degrade at lower frame rates, our approach remains remarkably robust and stable\. At12\.512\.5Hz, HybridCodec achieves the best intelligibility with dWER of1\.471\.47, a significant improvement over the7\.947\.94dWER of the discrete\-only FocalCodec baseline\. Speaker similarity also remains high \(97\.1\) compared to other12\.512\.5Hz models like Mimi\[kyutai2024moshi\]\(96\.096\.0\)\. Even at an extreme6\.256\.25Hz, performance remains nearly identical \(3\.98 UTMOS, 1\.50 dWER\)\. This shows that our residual information effectively mitigates the quantization penalty, offering an efficient alternative to high\-frequency NACs\. TTS:Table[2](https://arxiv.org/html/2606.27627#S4.T2)summarizes the results of discrete and hybrid TTS and ASR using HybridCodec within the HybridLM framework\. Note that the discrete, non\-finetuned version is identical to the publicly available, frame\-rate\-matched FocalCodec\[focalcodec\]\. For TTS, these results empirically validate our initial hypothesis: introducing a single non\-autoregressive residual prediction step at the end of inference effectively mitigates the severe performance degradation typical of low\-token\-rate, discrete\-only codecs\. For instance, in the zero\-shot setting at 12\.5 Hz, our hybrid method more than doubles the UTMOS score \(4\.104\.10vs\.1\.991\.99\) and reduces the dWER by more than half \(14\.7914\.79vs\.32\.9732\.97\) compared to the discrete baseline\. Similarly to our resynthesis findings, the performance gap between the hybrid and discrete representations widens as the operating frequency decreases\. At an extremely low rate of6\.256\.25Hz, the hybrid approach still achieves a UTMOS of3\.083\.08and a dWER of4848, vastly outperforming the discrete\-only representation \(1\.441\.44UTMOS and121121dWER\)\. This demonstrates that as strict quantization discards acoustic details, our continuous refinement becomes important for recovering signal fidelity\. ASR:The inclusion of continuous residuals not only maintains semantic integrity but consistently improves transcription accuracy across all frame rates\. While discrete tokens capture core semantics, continuous residuals provide acoustic cues that directly improve recognition\. As shown in Table[2](https://arxiv.org/html/2606.27627#S4.T2), the hybrid approach lowers the WER at 50 Hz from 28\.11 to 23\.36, and the CER from 14\.48 to 12\.36\. This performance gain holds even at higher compression levels: at 12\.5 Hz, the hybrid model achieves a 25\.94 WER \(vs\. 28\.50 for the baseline\), and at the extreme 6\.25 Hz rate, it reduces the WER from 29\.13 to 27\.36\. This is a critical success for our unified framework, confirming that the continuous acoustic information improves rather than interferes with the underlying semantic representations used by HybridLM\. ## 6Conclusion This work introduced HybridCodec, a novel framework that bridges discrete efficiency and continuous acoustic fidelity at remarkably low frame rates\. By combining discrete tokens with a non\-autoregressive residual pathway, we recovered high\-fidelity speech details at an ultra\-low temporal resolution of 6\.25 Hz\. Our results show that this hybrid approach outperforms a discrete\-only baseline in TTS quality and intelligibility, while simultaneously reducing error rates in discriminative ASR tasks\. The HybridLM architecture further shows that these dual representations can be unified within a single Transformer via AdaLN\. By operating at such extreme compression levels, our method significantly reduces the number of inference steps, offering a highly efficient alternative for long\-form synthesis\. ## 7Generative AI Use Disclosure LLMs\[chatgpt,copilot,opus,gemini,ai2\_asta\]have been used for advanced search, for boilerplate automation, and as a technical reference\. LLMs have not been used to author text for the paper, except BibTeX formatting and grammar/wording revisions\. LLM outputs were manually reviewed\. ## 8Acknowledgments We gratefully acknowledge the support of NSERC, the Digital Research Alliance of Canada \(alliancecan\.ca\), Translated \(Imminent Program\), and Apple \(Seed Grant\) through research funding, computing resources, and donations\. Samir Sadok was supported by the VisaSpeech Inria Associated Team initiative\. ## References
Similar Articles
Continuous Audio Language Models
This paper introduces Continuous Audio Language Models (CALM), which generate audio using continuous frames instead of discrete tokens to improve fidelity and reduce computational cost in speech and music generation.
Hierarchical Codec Diffusion for Video-to-Speech Generation
HiCoDiT is a novel Hierarchical Codec Diffusion Transformer for video-to-speech generation that leverages the hierarchical structure of RVQ-based codec discrete speech tokens, using coarse-to-fine conditioning with dual-scale normalization to achieve strong audio-visual alignment.
Continuous Audio Thinking for Large Audio Language Models
The paper introduces Continuous Audio Thinking (CoAT), a framework that equips large audio language models with a continuous latent workspace to organize acoustic information before generating textual responses, improving performance on audio reasoning, understanding, and transcription tasks without additional decoding cost.
AdaCodec: A Predictive Visual Code for Video MLLMs
AdaCodec reduces video encoding redundancy in multimodal LLMs by transmitting full visual tokens only when scene prediction fails, otherwise using compact inter-frame change descriptions. It outperforms per-frame RGB baselines at matched token budgets and achieves better or comparable results with significantly fewer tokens, reducing time-to-first-token from 9.26s to 1.62s.
One Model, Many Latencies: Universal Speech Enhancement for Diverse Real-Time Applications
A universal speech enhancement model that allows configurable control over both algorithmic and computational latency via parallel convolutions and early-exit mechanisms, enabling a single model to serve diverse real-time applications without retraining.