Continuous Audio Thinking for Large Audio Language Models

arXiv cs.AI Papers

Summary

The paper introduces Continuous Audio Thinking (CoAT), a framework that equips large audio language models with a continuous latent workspace to organize acoustic information before generating textual responses, improving performance on audio reasoning, understanding, and transcription tasks without additional decoding cost.

arXiv:2606.18273v1 Announce Type: cross Abstract: Large audio language models (LALMs) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis. However, because LALMs are typically trained to produce text-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response. We introduce Continuous Audio Thinking (CoAT), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline. Across three LALMs, Qwen2-Audio, Qwen2.5-Omni-7B, and Audio Flamingo~3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model's textual responses.
Original Article
View Cached Full Text

Cached at: 06/18/26, 05:44 AM

# Continuous Audio Thinking for Large Audio Language Models
Source: [https://arxiv.org/html/2606.18273](https://arxiv.org/html/2606.18273)
Gyojin Han Dong\-Jae Lee∗Changho Choi∗Jongsuk Kim∗Junmo Kim KAIST, South Korea \{hangj0820, jhtwosun, ccho4702, jskpop, junmo\.kim\}@kaist\.ac\.kr

###### Abstract

Large audio language models \(LALMs\) have shown impressive capabilities on diverse audio understanding tasks, ranging from speech transcription to music analysis\. However, because LALMs are typically trained to produce text\-aligned responses, their hidden states are progressively shaped for text generation rather than for preserving acoustic information\. As a result, the diverse acoustic content that audio carries, such as phonetic detail, prosody, sound events, affect, and pitch, is lost along the way and difficult to leverage in the response\. We introduce Continuous Audio Thinking \(CoAT\), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information prior to response generation, grounded by distillation from audio experts\. Within the thinking space, the model can utilize the rich acoustic information provided by expert distillation when generating its response\. Furthermore, the proposed continuous thinking block can be processed in a single prefill, so CoAT does not require additional autoregressive decoding cost over the baseline\. Across three LALMs, Qwen2\-Audio, Qwen2\.5\-Omni\-7B, and Audio Flamingo 3, performance gains on a broad benchmark suite spanning audio reasoning, audio understanding, music classification, speech emotion, and speech transcription demonstrate the effectiveness of CoAT\. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model’s textual responses\.

## 1Introduction

Large Audio Language Models \(LALMs\)\[[9](https://arxiv.org/html/2606.18273#bib.bib2),[55](https://arxiv.org/html/2606.18273#bib.bib3),[19](https://arxiv.org/html/2606.18273#bib.bib7),[49](https://arxiv.org/html/2606.18273#bib.bib4)\]have established themselves as a natural interface for understanding speech, environmental sounds, music, and other acoustic signals through language\. These models couple an audio encoder with a language model trained to autoregressively generate textual responses\. This design enables strong progress on spoken dialogue and audio question answering, but it also introduces a fundamental supervision mismatch\. The input contains rich frame\-level acoustic structure, while the training objective provides signal only through sparse response tokens\. The layers above the audio encoder are encouraged to retain only the information that is immediately useful for predicting the next text token, leaving many fine\-grained acoustic cues weakly supervised or discarded\.

This mismatch is especially pronounced for audio, which carries many properties that transcription alone cannot convey, including phonetic detail, speaker affect, background scene, prosody, and musical structure\. One possible solution is to have the model verbalize intermediate thinking, as in discrete text chain\-of\-thought prompting\[[54](https://arxiv.org/html/2606.18273#bib.bib15),[29](https://arxiv.org/html/2606.18273#bib.bib16),[52](https://arxiv.org/html/2606.18273#bib.bib17)\], describing what the audio contains in natural language \(for instance, “a man speaking in a high\-pitched voice” or “the first note is a C”\) before producing the answer\. However, many acoustic attributes cannot be serialized into natural language without losing fine\-grained temporal and spectral detail\. Faithful natural\-language rationales for them are rarely available at scale, and even when they are, compressing low\-level acoustic evidence into text introduces an unnecessary bottleneck\.

![Refer to caption](https://arxiv.org/html/2606.18273v1/x1.png)Figure 1:Thinking paradigms in audio language models\.\(a\) Vanilla audio LMs decode the response directly from audio and instruction tokens\. \(b\) Discrete thinking generates textual thinking tokens autoregressively before the answer\. \(c\) Continuous Audio Thinking \(ours\) prepends a fixed\-length block of continuous thinking tokens that is consumed in a single prefill, letting the model think in an audio\-aligned latent space without autoregressive cost\.These limitations motivate a new form of thinking that preserves and reorganizes acoustic information itself rather than verbalizing it\. To meet this need, such thinking is required to unfold in continuous latent positions, remaining independent of text and free from supervision by natural\-language rationales\. It is further expected to serve as a workspace in which the language model maintains, aligns, and transforms acoustic information before committing to response generation\.

Toward this, we proposeContinuous Audio Thinking\(CoAT\), an auxiliary\-supervision framework that equips audio language models with a latent workspace\. CoAT places a thinking block between the user input and the assistant response, where the model organizes acoustic information before generating its reply\. The thinking block is grounded by distillation from diverse audio experts covering reconstruction, speech content\[[41](https://arxiv.org/html/2606.18273#bib.bib39)\], sound events\[[30](https://arxiv.org/html/2606.18273#bib.bib27)\], paralinguistic features\[[35](https://arxiv.org/html/2606.18273#bib.bib30)\], and pitch\[[3](https://arxiv.org/html/2606.18273#bib.bib32)\], providing complementary acoustic dimensions that text supervision alone cannot supply\. CoAT integrates with existing LALMs without architectural changes\. Furthermore, CoAT requires neither textual rationales nor task\-specific decoding formats, and can be learned from a modest amount of audio data\. At inference, the thinking block is consumed in a single prefill, adding no autoregressive decoding cost over the baseline as shown in Figure[1](https://arxiv.org/html/2606.18273#S1.F1)\.

We instantiate CoAT on three LALMs \(Qwen2\-Audio\[[9](https://arxiv.org/html/2606.18273#bib.bib2)\], Qwen2\.5\-Omni\-7B\[[55](https://arxiv.org/html/2606.18273#bib.bib3)\], and Audio Flamingo 3\[[19](https://arxiv.org/html/2606.18273#bib.bib7)\]\) and evaluate it across diverse audio understanding and reasoning benchmarks\. CoAT yields consistent gains across backbones and outperforms text chain\-of\-thought with substantially lower per\-sample latency\. Further analysis confirms that the auxiliary supervision propagates from the thinking positions to the model’s textual responses\. We summarize our contributions as follows:

- •We propose a continuous\-thinking paradigm for LALMs, enabling the model to organize acoustic information in the latent space without verbalization or autoregressive decoding\.
- •We propose a multi\-expert distillation objective that grounds the thinking states in complementary acoustic dimensions, supplying signal that text supervision alone cannot provide\.
- •We evaluate CoAT on three LALMs and show that it consistently improves audio understanding and reasoning at lower latency than text chain\-of\-thought, with analysis showing that the supervision propagates from the thinking positions to the model’s textual responses\.

## 2Related Work

##### Audio Language Models\.

A growing family of large language models has been adapted to natively process audio alongside text\[[10](https://arxiv.org/html/2606.18273#bib.bib1),[9](https://arxiv.org/html/2606.18273#bib.bib2),[55](https://arxiv.org/html/2606.18273#bib.bib3),[31](https://arxiv.org/html/2606.18273#bib.bib5),[15](https://arxiv.org/html/2606.18273#bib.bib6),[19](https://arxiv.org/html/2606.18273#bib.bib7),[49](https://arxiv.org/html/2606.18273#bib.bib4),[20](https://arxiv.org/html/2606.18273#bib.bib14),[13](https://arxiv.org/html/2606.18273#bib.bib10),[16](https://arxiv.org/html/2606.18273#bib.bib11),[47](https://arxiv.org/html/2606.18273#bib.bib12),[33](https://arxiv.org/html/2606.18273#bib.bib13),[60](https://arxiv.org/html/2606.18273#bib.bib8),[22](https://arxiv.org/html/2606.18273#bib.bib9)\]\. A common architecture connects a pretrained audio encoder, typically an audio\-text\-aligned Whisper\[[44](https://arxiv.org/html/2606.18273#bib.bib68)\]encoder, to a language model via a lightweight projection layer, exposing audio as a stream of soft\-token embeddings that are consumed by the decoder\[[49](https://arxiv.org/html/2606.18273#bib.bib4),[31](https://arxiv.org/html/2606.18273#bib.bib5),[15](https://arxiv.org/html/2606.18273#bib.bib6),[19](https://arxiv.org/html/2606.18273#bib.bib7),[20](https://arxiv.org/html/2606.18273#bib.bib14),[13](https://arxiv.org/html/2606.18273#bib.bib10),[16](https://arxiv.org/html/2606.18273#bib.bib11)\]\. Another line trains a single multimodal model end\-to\-end so that audio, vision, and text share a unified token space\[[55](https://arxiv.org/html/2606.18273#bib.bib3),[60](https://arxiv.org/html/2606.18273#bib.bib8),[22](https://arxiv.org/html/2606.18273#bib.bib9)\]\. In both cases, audio capabilities are typically obtained from supervised fine\-tuning on triplets of audio, instruction, and target response, sometimes structured as a multi\-stage curriculum that first aligns the audio encoder before instruction\-tuning the language model\[[19](https://arxiv.org/html/2606.18273#bib.bib7),[55](https://arxiv.org/html/2606.18273#bib.bib3)\]\.

##### Continuous and Latent Thinking\.

Chain\-of\-thought prompting\[[54](https://arxiv.org/html/2606.18273#bib.bib15),[29](https://arxiv.org/html/2606.18273#bib.bib16),[52](https://arxiv.org/html/2606.18273#bib.bib17)\]demonstrated that explicit step\-by\-step traces in language tokens improve the reasoning behavior of large language models\. Subsequent work has examined whether reasoning must remain in the discrete token space: Coconut\[[23](https://arxiv.org/html/2606.18273#bib.bib20)\]replaces textual chains with continuous embeddings that are looped back into the model, and Quiet\-STaR\[[59](https://arxiv.org/html/2606.18273#bib.bib19)\]learns implicit thoughts that are scored against the next\-token likelihood\. A related thread treats internal recurrence and depth as an explicit reasoning resource via looped transformers and implicit chain\-of\-thought formulations\[[17](https://arxiv.org/html/2606.18273#bib.bib21),[12](https://arxiv.org/html/2606.18273#bib.bib22)\]\. A concurrent line extends continuous reasoning beyond text: Chain\-of\-Visual\-Thought\[[43](https://arxiv.org/html/2606.18273#bib.bib38)\]introduces continuous visual tokens that allow vision\-language models to reason in a vision\-aligned latent space, but it still autoregressively generates additional tokens for each task and pairs them with explicit textual reasoning, leaving the inference\-cost limitations of token\-by\-token reasoning largely unaddressed\.

##### Audio Encoders for Diverse Tasks\.

A wide range of audio encoders has been developed, each capturing a different aspect of the signal\. Speech understanding is supported by self\-supervised speech models that learn linguistic units from raw waveforms\[[24](https://arxiv.org/html/2606.18273#bib.bib23),[2](https://arxiv.org/html/2606.18273#bib.bib24),[6](https://arxiv.org/html/2606.18273#bib.bib25),[41](https://arxiv.org/html/2606.18273#bib.bib39)\]\. General sound\-event analysis builds on classifiers and masked models pretrained on broad acoustic taxonomies\[[30](https://arxiv.org/html/2606.18273#bib.bib27),[7](https://arxiv.org/html/2606.18273#bib.bib26),[26](https://arxiv.org/html/2606.18273#bib.bib28),[8](https://arxiv.org/html/2606.18273#bib.bib29)\]\. Music understanding is addressed by encoders specialized for tonal and rhythmic structure\[[57](https://arxiv.org/html/2606.18273#bib.bib31),[3](https://arxiv.org/html/2606.18273#bib.bib32)\], while paralinguistic analysis relies on self\-supervised models trained on emotional speech\[[35](https://arxiv.org/html/2606.18273#bib.bib30)\]\. Neural audio codecs provide a complementary representation, compressing waveforms into compact latent sequences from which the original signal can be decoded\[[58](https://arxiv.org/html/2606.18273#bib.bib33),[11](https://arxiv.org/html/2606.18273#bib.bib34),[62](https://arxiv.org/html/2606.18273#bib.bib35)\]\. We collect these complementary encoders into a single place and jointly utilize them in training time to enable an audio language model to think continuously, turning a scattered set of task\-specific representations into a single audio\-aware reasoning substrate\.

## 3Method

![Refer to caption](https://arxiv.org/html/2606.18273v1/x2.png)Figure 2:CoAT architecture\.A continuous audio thinking block is supervised by five audio experts via per\-task projection heads, covering audio feature reconstruction, speech representation, sound event detection, paralinguistic features, and pitch\. The projection heads decode the shared hidden states into expert\-aligned predictions, used only during training\.In this section, we propose Continuous Audio Thinking \(CoAT\), a method that allows LALMs to retain and organize acoustic content in latent form before generating their textual response\. CoAT inserts a sequence of continuous latent positions between the audio input and the assistant response, and supervises the hidden states at those positions by distillation against multiple audio experts\. The thinking block thus serves as a workspace grounded in complementary acoustic dimensions that text supervision alone cannot provide\. We first formalize the audio language model interface \(§[3\.1](https://arxiv.org/html/2606.18273#S3.SS1)\), then introduce the thinking block \(§[3\.2](https://arxiv.org/html/2606.18273#S3.SS2)\), the expert distillation \(§[3\.3](https://arxiv.org/html/2606.18273#S3.SS3)\), and the stage\-wise training schedule \(§[3\.4](https://arxiv.org/html/2606.18273#S3.SS4)\)\. The overall pipeline of the proposed method is shown in Figure[2](https://arxiv.org/html/2606.18273#S3.F2)\.

### 3\.1Large Audio Language Models

Before describing the method, we briefly outline how LALMs operate\. We consider an audio language model composed of an LLM decoderfLf\_\{L\}and an audio encoderfAf\_\{A\}\. The encoderfAf\_\{A\}encodes a raw audio waveformaainto a sequence ofLaL\_\{a\}audio token embeddings𝐱A∈ℝLa×d\\mathbf\{x\}\_\{A\}\\in\\mathbb\{R\}^\{L\_\{a\}\\times d\}at a fixed ratersr\_\{s\}, whereddis the hidden size offLf\_\{L\}\. The input sequence is composed of a system prompt and a user prompt,𝐱=\[𝐱sys,𝐱usr\]\\mathbf\{x\}=\[\\mathbf\{x\}^\{\\text\{sys\}\},\\mathbf\{x\}^\{\\text\{usr\}\}\], where the user prompt itself contains the audio token embeddings followed by a text prompt,𝐱usr=\[𝐱A,𝐱txt\]\\mathbf\{x\}^\{\\text\{usr\}\}=\[\\mathbf\{x\}\_\{A\},\\mathbf\{x\}\_\{\\text\{txt\}\}\]\. The model is trained with next\-token prediction using the cross\-entropy loss

ℒCE​\(𝐱\)=−∑t∈ℐ​\(𝐱\)log⁡pfL​\(xt∣x<t\),\\mathcal\{L\}\_\{\\text\{CE\}\}\(\\mathbf\{x\}\)=\-\\sum\_\{t\\in\\mathcal\{I\}\(\\mathbf\{x\}\)\}\\log p\_\{f\_\{L\}\}\(x\_\{t\}\\mid x\_\{<t\}\),\(1\)whereℐ​\(𝐱\)\\mathcal\{I\}\(\\mathbf\{x\}\)indexes positions in the assistant response\.

### 3\.2Continuous Audio Thinking Block

The thinking block introduces dedicated capacity for the model to process acoustic information before generating its response\. We extend the model vocabulary with three special tokens,τs=<\|audio\_think\_start\|\>\\tau\_\{s\}=\\texttt\{<\|audio\\\_think\\\_start\|\>\},τp=<\|audio\_think\|\>\\tau\_\{p\}=\\texttt\{<\|audio\\\_think\|\>\}, andτe=<\|audio\_think\_end\|\>\\tau\_\{e\}=\\texttt\{<\|audio\\\_think\\\_end\|\>\}, that serve as the boundary and content positions of an audio thinking block\. Given an input that contains an audio segment of lengthLaL\_\{a\}, the corresponding thinking block is constructed by placing oneτp\\tau\_\{p\}for each audio token, framed by the boundary tokens,

𝐛​\(La\)=\[τs,τp,…,τp⏟La,τe\]\.\\mathbf\{b\}\(L\_\{a\}\)=\\big\[\\,\\tau\_\{s\},\\,\\underbrace\{\\tau\_\{p\},\\ldots,\\tau\_\{p\}\}\_\{L\_\{a\}\},\\,\\tau\_\{e\}\\,\\big\]\.\(2\)The block𝐛​\(La\)\\mathbf\{b\}\(L\_\{a\}\)is appended to the input sequence,𝐱~=\[𝐱sys,𝐱usr,𝐛​\(La\)\]\\tilde\{\\mathbf\{x\}\}=\[\\mathbf\{x\}^\{\\text\{sys\}\},\\mathbf\{x\}^\{\\text\{usr\}\},\\mathbf\{b\}\(L\_\{a\}\)\], and processed by decoderfLf\_\{L\}\. We collect the final\-layer hidden states𝐇think∈ℝLa×d\\mathbf\{H\}\_\{\\text\{think\}\}\\in\\mathbb\{R\}^\{L\_\{a\}\\times d\}at the positionsτp\\tau\_\{p\}\. At training time, the cross\-entropy in Eq\. \([1](https://arxiv.org/html/2606.18273#S3.E1)\) is computed only at response positions; the thinking\-block tokensτs,τp,τe\{\\tau\_\{s\},\\tau\_\{p\},\\tau\_\{e\}\}are excluded as next\-token prediction targets\. During inference, the same block is appended deterministically before generation, adding prefill cost without additional autoregressive decoding\.

### 3\.3Distillation from Audio Experts

The thinking block provides space for processing acoustic information, but the text\-only objective alone is insufficient for the model to learn to use this space effectively\. We therefore distill frame\-level features from audio experts to fill this gap\. LetKKdenote the set of frozen audio experts used in CoAT\. For each expertk∈Kk\\in K, an encoderEk\\textbf\{E\}\_\{k\}produces expert features𝐳k=Ek​\(a\)∈ℝLk×ek\\mathbf\{z\}\_\{k\}=\\textbf\{E\}\_\{k\}\(a\)\\in\\mathbb\{R\}^\{L\_\{k\}\\times e\_\{k\}\}from the raw audioaa\. For each expertk∈Kk\\in K, we attach a projection headPkP\_\{k\}to the language model that projects𝐇think\\mathbf\{H\}\_\{\\text\{think\}\}into expert\-aligned predictions,

𝐳^k=Pk​\(𝐇think\)∈ℝLa×ek\.\\hat\{\\mathbf\{z\}\}\_\{k\}=P\_\{k\}\(\\mathbf\{H\}\_\{\\text\{think\}\}\)\\in\\mathbb\{R\}^\{L\_\{a\}\\times e\_\{k\}\}\.\(3\)EachPkP\_\{k\}is implemented as a single\-block Transformer \(multi\-head attention and a feed\-forward layer\) followed by a linear map to the expert embedding dimensioneke\_\{k\}\. The experts fall into two families: representational experts and task\-specialized experts, described in turn below\.

#### 3\.3\.1Representational experts for location and content

Representational experts, comprising audio feature reconstruction and speech distillation, establish a foundation on which all subsequent supervision rests\. Audio feature reconstruction anchors𝐇think\\mathbf\{H\}\_\{\\text\{think\}\}to the audio encoder’s latent space, so that the thinking block occupies the same subspace as the input acoustic representation\. Speech distillation then enriches that representation with the linguistic structure captured by a self\-supervised speech encoder, endowing the thinking block with both location and content before any application\-specific objective is introduced\.

##### Audio Feature Reconstruction\.

The expert encoder is the audio encoder of the backbone itself,𝐄audio=fA\\mathbf\{E\}\_\{\\text\{audio\}\}=f\_\{A\}\. The thinking block is trained to reproduce the same latent representation thatfLf\_\{L\}consumes for the audio, which constrains𝐇think\\mathbf\{H\}\_\{\\text\{think\}\}to the audio\-token subspace and supervises where acoustic information should be encoded before any semantic objective is introduced\. We supervise this with a frame\-wise MSE,ℒrecon=MSE​\(𝐳^audio,𝐳audio\)\\mathcal\{L\}\_\{\\text\{recon\}\}=\\mathrm\{MSE\}\(\\hat\{\\mathbf\{z\}\}\_\{\\text\{audio\}\},\\mathbf\{z\}\_\{\\text\{audio\}\}\)\.

##### Speech Representation Distillation\.

Furthermore, we employ𝐄SPIDR\\mathbf\{E\}\_\{\\text\{SPIDR\}\}\[[41](https://arxiv.org/html/2606.18273#bib.bib39)\], a self\-supervised speech encoder trained without labels to extract stable linguistic units from raw waveforms\. We use its encoder output as the expert feature, supervised with a frame\-wise MSE,ℒspeech=MSE​\(𝐳^SPIDR,𝐳SPIDR\)\\mathcal\{L\}\_\{\\text\{speech\}\}=\\mathrm\{MSE\}\(\\hat\{\\mathbf\{z\}\}\_\{\\text\{SPIDR\}\},\\mathbf\{z\}\_\{\\text\{SPIDR\}\}\)\. This task aligns the thinking representation with phonetic and lexical content, complementing the surface\-level signal from audio feature reconstruction\.

#### 3\.3\.2Task\-specialized experts for application\-domain capabilities

Building on this foundation, we attach experts that supply task\-specialized capabilities\. Each adds a capability that is otherwise difficult to acquire under text supervision alone, namely sound event detection for environmental audio, paralinguistic features for vocal affect, and pitch prediction for harmonic and prosodic structure\. Importantly, this design is not specific to the three experts used here\. Any audio encoder that captures a desired representation can be incorporated as an additional task\.

##### Sound Event Detection\.

For sound\-event semantics, we use the expert encoder𝐄PANNs\\mathbf\{E\}\_\{\\text\{PANNs\}\}\[[30](https://arxiv.org/html/2606.18273#bib.bib27)\], a CNN\-based audio tagger pre\-trained on AudioSet that produces frame\-wise activations over 527 sound\-event classes spanning speech, music, animal, vehicle, and ambient categories\. We map the student prediction into the same class\-logit space by passing it through PANNs’s final classification headgclsg\_\{\\mathrm\{cls\}\}, and match it to the expert’s per\-frame, per\-class probabilities with binary cross\-entropy,ℒsed=BCE​\(gcls​\(𝐳^PANNs\),gcls​\(𝐳PANNs\)\)\\mathcal\{L\}\_\{\\text\{sed\}\}=\\mathrm\{BCE\}\(g\_\{\\mathrm\{cls\}\}\(\\hat\{\\mathbf\{z\}\}\_\{\\text\{PANNs\}\}\),g\_\{\\mathrm\{cls\}\}\(\\mathbf\{z\}\_\{\\text\{PANNs\}\}\)\)\. This task exposes the thinking representation to a broad taxonomy of sound\-event semantics, supporting general audio understanding tasks beyond speech\.

##### Paralinguistic Feature Prediction\.

Voice affect is captured by𝐄emotion2vec\\mathbf\{E\}\_\{\\text\{emotion2vec\}\}\[[35](https://arxiv.org/html/2606.18273#bib.bib30)\], a self\-supervised model trained on speech emotion data that produces frame\-wise representations\. We use its hidden\-state output as the expert feature, supervised with a frame\-wise MSE,ℒemo=MSE​\(𝐳^emotion2vec,𝐳emotion2vec\)\\mathcal\{L\}\_\{\\text\{emo\}\}=\\mathrm\{MSE\}\(\\hat\{\\mathbf\{z\}\}\_\{\\text\{emotion2vec\}\},\\mathbf\{z\}\_\{\\text\{emotion2vec\}\}\)\. This task adds a non\-lexical channel to the thinking representation that captures how the audio is spoken, such as affect, prosody, and intensity, rather than what is said\.

##### Pitch Prediction\.

For pitch and harmonic structure, we adopt𝐄basic\-pitch\\mathbf\{E\}\_\{\\text\{basic\-pitch\}\}\[[3](https://arxiv.org/html/2606.18273#bib.bib32)\], a polyphonic pitch detector pre\-trained on instrument transcription with a multi\-pitch posteriorgram output\. We use its intermediate convolutional activations as the expert feature; the dense intermediate representation preserves harmonic information while avoiding the sparsity of the final posteriorgram\. The pitch task combines two losses, an MSE term on the intermediate feature and an auxiliary focal\-BCE on the posteriorgram obtained by passing both student and expert features through the encoder’s final convolutionhh\. The loss is given byℒpitch=MSE​\(𝐳^basic\-pitch,𝐳basic\-pitch\)\+wpitchaux​focal​\-​BCE​\(h​\(𝐳^basic\-pitch\),h​\(𝐳basic\-pitch\)\)\\mathcal\{L\}\_\{\\text\{pitch\}\}=\\mathrm\{MSE\}\(\\hat\{\\mathbf\{z\}\}\_\{\\text\{basic\-pitch\}\},\\mathbf\{z\}\_\{\\text\{basic\-pitch\}\}\)\+w\_\{\\text\{pitch\}\}^\{\\text\{aux\}\}\\mathrm\{focal\\text\{\-\}BCE\}\(h\(\\hat\{\\mathbf\{z\}\}\_\{\\text\{basic\-pitch\}\}\),h\(\\mathbf\{z\}\_\{\\text\{basic\-pitch\}\}\)\)\. Pure BCE on the highly sparse posteriorgram collapses to all\-zero predictions, while pure MSE on the dense intermediate feature under\-constrains the pitch contour\. This task introduces fine\-grained pitch information into the thinking representation, which is otherwise sparsely covered by the speech\- and sound\-centric experts\.

### 3\.4Training Objective

##### Stage\-Wise Training\.

Training is organized as a sequence ofSSstages\. Each stage is defined by its data, step budget, learning rate, and an active task subset𝒜s⊆K\\mathcal\{A\}\_\{s\}\\subseteq Kwith per\-task weightswk\(s\)w\_\{k\}^\{\(s\)\}, wheressdenotes the stage index\. Each stage applies only the expert distillation tasks in𝒜s\\mathcal\{A\}\_\{s\}in addition to the language\-modeling cross\-entropy\. The parameter scope spans LoRA\[[25](https://arxiv.org/html/2606.18273#bib.bib67)\]adapters, projection heads, added tokens, and the embedding layer\. We instantiateS=2S=2in our main experiments, with a warm\-up stage𝒜1\\mathcal\{A\}\_\{1\}using audio feature reconstruction alone, followed by a multi\-task stage with𝒜2\\mathcal\{A\}\_\{2\}spanning all experts\.

##### Overall Objective\.

The total loss at stagessis the sum of the language\-modeling and distillation losses

ℒtotal\(s\)​\(𝐱~\)=ℒCE​\(𝐱~\)\+∑k∈𝒜swk\(s\)​ℒk,\\mathcal\{L\}^\{\(s\)\}\_\{\\text\{total\}\}\(\\tilde\{\\mathbf\{x\}\}\)=\\mathcal\{L\}\_\{\\text\{CE\}\}\(\\tilde\{\\mathbf\{x\}\}\)\+\\sum\_\{k\\in\\mathcal\{A\}\_\{s\}\}w\_\{k\}^\{\(s\)\}\\,\\mathcal\{L\}\_\{k\},\(4\)withℒCE\\mathcal\{L\}\_\{\\text\{CE\}\}computed on𝐱~\\tilde\{\\mathbf\{x\}\}under the thinking\-token mask\. The thinking block thus contributes purely through the distillation losses, leaving the model’s text generation unchanged\.

## 4Experiments

### 4\.1Experimental Setup

We instantiate CoAT on three pretrained audio language models: Qwen2\-Audio\[[9](https://arxiv.org/html/2606.18273#bib.bib2)\], Qwen2\.5\-Omni\-7B\[[55](https://arxiv.org/html/2606.18273#bib.bib3)\], and Audio Flamingo 3\[[19](https://arxiv.org/html/2606.18273#bib.bib7)\]\. All backbones are trained with the same two\-stage CoAT schedule: a reconstruction\-only warm\-up stage followed by a multi\-task stage with all five experts active\.

We use the same public training mixture, sampling policy, and evaluation protocol across all backbones and benchmarks\. The training mixture covers automatic speech recognition\[[40](https://arxiv.org/html/2606.18273#bib.bib40),[5](https://arxiv.org/html/2606.18273#bib.bib41),[1](https://arxiv.org/html/2606.18273#bib.bib47),[50](https://arxiv.org/html/2606.18273#bib.bib46),[18](https://arxiv.org/html/2606.18273#bib.bib42),[39](https://arxiv.org/html/2606.18273#bib.bib44)\], audio and speech question answering\[[28](https://arxiv.org/html/2606.18273#bib.bib56),[4](https://arxiv.org/html/2606.18273#bib.bib55),[63](https://arxiv.org/html/2606.18273#bib.bib62),[32](https://arxiv.org/html/2606.18273#bib.bib59)\], audio captioning\[[28](https://arxiv.org/html/2606.18273#bib.bib56),[14](https://arxiv.org/html/2606.18273#bib.bib58)\], multiple\-choice question audio understanding\[[42](https://arxiv.org/html/2606.18273#bib.bib54),[4](https://arxiv.org/html/2606.18273#bib.bib55),[63](https://arxiv.org/html/2606.18273#bib.bib62)\], music understanding\[[38](https://arxiv.org/html/2606.18273#bib.bib61)\], spoken\-instruction following\[[46](https://arxiv.org/html/2606.18273#bib.bib69)\], and a small text\-only supervised fine\-tuning split\[[27](https://arxiv.org/html/2606.18273#bib.bib60)\]\. Further experimental details are provided in Appendix[A](https://arxiv.org/html/2606.18273#A1)\.

### 4\.2Main Results

Table 1:Main results\.Per\-benchmark performance of Qwen2\-Audio, Qwen2\.5\-Omni\-7B, and Audio Flamingo 3 with and without CoAT\.Across all three backbones, CoAT improves the pretrained baseline on the majority of benchmarks, as reported in Table[1](https://arxiv.org/html/2606.18273#S4.T1), demonstrating its generality across various models and tasks\. The improvements are most pronounced on understanding\- and reasoning\-intensive tasks such as MELD, MMAR, and Alpaca\-Audio, indicating that CoAT is particularly effective on audio understanding and reasoning On automatic speech recognition, CoAT substantially improves the weaker Qwen2\-Audio backbone while preserving performance on Qwen2\.5\-Omni and Audio Flamingo 3\. Overall, CoAT generalizes across heterogeneous audio\-language backbones and yields the largest gains on reasoning\-heavy audio understanding\.

### 4\.3Comparison with Discrete Thinking

We compare CoAT’s continuous thinking approach against the discrete\-thinking alternative where the model autoregressively generates a textual reasoning trace before producing the answer\. For Audio Flamingo 3, which has a native think model, we use built\-in chain\-of\-thought template\. For Qwen2\.5\-Omni, which has no native think mode, we instead induce text\-CoT through a prompt\-level instruction asking the model to reason step by step\.

Table[2](https://arxiv.org/html/2606.18273#S4.T2)reports reasoning accuracy on two benchmarks \(MMAU and MMAR\) with four inference\-cost metrics: time to first token \(TTFT\), decoding time \(Dec\. time\), the number of decoded tokens \(Dec\. tok\), and end\-to\-end latency \(Total\)\. Each reported value is the mean over multiple runs measured in the same environment; full experimental details are provided in Appendix[C](https://arxiv.org/html/2606.18273#A3)\.

Table 2:Comparison with discrete reasoning\.Reasoning accuracy and per\-sample inference cost on the full MMAU and MMAR evaluation sets\.Text\-CoT and CoAT differ in where they spend inference cost\. Text\-CoT autoregressively decodes a reasoning chain before the answer, so its overhead lands almost entirely in the decode stage and does not amortize across batched requests\. CoAT instead consumes its thinking block in a single prefill, shifting the overhead out of the decoding phase\. As a result, CoAT runs much faster than text\-CoT while improving reasoning accuracy over the same\-backbone baseline\. Appendix[D](https://arxiv.org/html/2606.18273#A4)provides a full per\-benchmark comparison with Audio Flamingo 3’s native think mode\.

### 4\.4Analysis

![Refer to caption](https://arxiv.org/html/2606.18273v1/x3.png)Figure 3:Linear probe accuracy at the audio\-think hidden across training checkpoints, on 4\-class IEMOCAP emotion and 12\-class MuchoMusic dominant pitch\.We probe whether CoAT’s auxiliary supervision injects task\-relevant information at thinking positions by training linear probes on the LM hidden state at two positions\. The first one is the audio\-think hidden, taken as the mean over the thinking blockτp\\tau\_\{p\}where CoAT’s supervision attaches\. The second one is the pre\-generation hidden, taken from the lastτe\\tau\_\{e\}token for CoAT and from the last input token before the assistant turn for the vanilla baseline\. Both correspond to the model’s decision\-time representation\.

##### Linear probing accuracy across training checkpoints\.

We probe two targets aligned with the training supervision, specifically 4\-class emotion on IEMOCAP and 12\-class dominant pitch on MuchoMusic\. Linear probes are trained on the audio\-think hidden of the full 5\-head CoAT across training checkpoints\. Figure[3](https://arxiv.org/html/2606.18273#S4.F3)shows that probe accuracy rises in stage 2 on both targets, when specialized experts begin contributing supervision beyond reconstruction\. This indicates that the auxiliary supervision injects task\-relevant information into the supervised position\.

##### Correlation between thinking\-token information and downstream performance\.

Table 3:Probe accuracy and Spearmanρ\\rhobetween probe confidence and downstream task performance, reported as \(accuracy/ρ\\text\{accuracy\}/\\rho\)\.ThinkandPre\-gendenote the audio\-think and pre\-generation hidden states, respectively\.We now probe two targets aligned with the downstream task, namely 4\-class emotion on IEMOCAP and 7\-class instrument family on the MuchoMusic instrument\-question subset\. We compare three models: the vanilla Qwen2\.5\-Omni baseline, a CoAT control learned only by representational experts, and the CoAT trained with all five experts\. Table[3](https://arxiv.org/html/2606.18273#S4.T3)reports probe accuracy alongside within\-model Spearman correlations between probe confidence and downstream task performance\. The CoAT trained with all five experts attains the highest probe accuracy and the strongest within\-model correlation across all cases\. Together, the supervision injects information at the thinking position that accumulates over training and correlates with downstream task performance\.

##### Visualizing per\-head reconstructions\.

To verify that the thinking representation distills information from each expert, we visualize example reconstructions in Figure[4](https://arxiv.org/html/2606.18273#S4.F4)\. For audio feature reconstruction, we use the Sim\-Whisper\[[62](https://arxiv.org/html/2606.18273#bib.bib35)\]codec decoder to reconstruct audio from the predicted feature, which we visualize as a mel\-spectrogram\. For the other tasks, we pass the predicted feature through the expert’s prediction head and visualize the resulting output\. Each task’s reconstruction faithfully matches the expert target, confirming that the thinking representation encodes the information required to perform every supervised task\.

![Refer to caption](https://arxiv.org/html/2606.18273v1/x4.png)\(a\)Audio feature reconstruction
![Refer to caption](https://arxiv.org/html/2606.18273v1/x5.png)\(b\)Sound event detection
![Refer to caption](https://arxiv.org/html/2606.18273v1/x6.png)\(c\)Paralinguistic feature prediction
![Refer to caption](https://arxiv.org/html/2606.18273v1/x7.png)\(d\)Pitch prediction

Figure 4:Example reconstructions from CoAT’s per\-task heads\.Each pair shows the expert target \(right\) and the corresponding student prediction \(left\) at the audio\-think positions\.

### 4\.5Ablation Studies

We conduct ablations on a single backbone, Qwen2\.5\-Omni, to validate our design choices\.

##### Task Ablation\.

Table[4](https://arxiv.org/html/2606.18273#S4.T4)reports the cumulative contribution of each component\. Supervised fine\-tuning on top of Qwen2\.5\-Omni substantially improves Emotion and reduces ASR error, but leaves General and AIR\-Bench essentially flat and regresses Music below the baseline\. Adding the continuous thinking block improves General, Emotion, and ASR, and partially recovers Music, though AIR\-Bench and Music remain below the original Qwen2\.5\-Omni baseline\. Representational expert distillation lifts AIR\-Bench substantially, further recovers Music, and raises General\. Specialized expert distillation attains the best score on every metric, including the only Music value that surpasses the original baseline\. Relative to SFT alone, CoAT improves every metric and is the only configuration that lifts both broad reasoning and music understanding above the original Qwen2\.5\-Omni baseline\. These results show that the improvements are not driven by SFT alone: representational expert distillation accounts for the gains in general understanding and reasoning, while specialized expert distillation yields the largest improvements on task\-relevant metrics\.

Table 4:Task ablation\.Starting from the baseline model, we cumulatively add one component at a time and retrain under the same schedule: supervised fine\-tuning \(SFT\), continuous thinking block, representational expert distillation, and specialized expert distillation\. We report the average performance of benchmarks with the same metric\.Table 5:Projector type\.We compare a linear projection against the single\-block Transformer projector used in the main results\. We report the average performance of benchmarks with the same metric\.
##### Projector Type\.

The choice of projector affects how the thinking representation is decoded into expert\-aligned predictions\. We compare a linear projector against the single\-block Transformer projector used in our main results\. Table[5](https://arxiv.org/html/2606.18273#S4.T5)shows that the Transformer projector generally outperforms the simple linear projection layer\.

## 5Conclusion

We introduce Continuous Audio Thinking \(CoAT\), a framework that equips audio language models with a continuous latent workspace for organizing acoustic information before response generation\. Audio experts supervise this workspace through distillation under a representational\-then\-specialized schedule: latent states are first anchored to the audio space and then aligned with semantic, paralinguistic, and musical structure\. CoAT consistently improves audio understanding and reasoning across three LALMs, while running at lower per\-sample latency than text chain\-of\-thought\. Further analyses show that the auxiliary signal propagates from the thinking positions to the textual outputs: linear probes on those positions became more accurate over training, and within\-model probe confidence predicted downstream task performance\. CoAT shows that continuous latent thinking can support reasoning in modalities that are difficult to verbalize\.

## Limitations and Future Work

Our study has two main limitations, which we leave to future work\. First, the thinking block in CoAT is deterministic, occupying a fixed span at a pre\-defined position between the audio input and the assistant response\. The model does not learn when or how long to think, nor does it interleave thinking with response generation\. CoAT thus realizes a latent workspace but not multi\-step latent reasoning, and extending it to dynamic or interleaved thinking blocks is an important next step\. Second, our empirical validation is confined to the audio domain\. Although the proposed mechanism is modality\-agnostic in principle, whether the same recipe transfers to vision\- and video\-language models remains an open question\.

## References

- \[1\]R\. Ardila, M\. Branson, K\. Davis, M\. Henretty, M\. Kohler, J\. Meyer, R\. Morais, L\. Saunders, F\. M\. Tyers, and G\. Weber\(2020\)Common voice: a massively\-multilingual speech corpus\.External Links:1912\.06670,[Link](https://arxiv.org/abs/1912.06670)Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[2\]A\. Baevski, Y\. Zhou, A\. Mohamed, and M\. Auli\(2020\)Wav2vec 2\.0: a framework for self\-supervised learning of speech representations\.Advances in neural information processing systems33,pp\. 12449–12460\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[3\]R\. M\. Bittner, J\. J\. Bosch, D\. Rubinstein, G\. Meseguer\-Brocal, and S\. Ewert\(2022\)A lightweight instrument\-agnostic model for polyphonic note transcription and multipitch estimation\.InProceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing \(ICASSP\),Singapore\.Cited by:[Table A](https://arxiv.org/html/2606.18273#A1.T1.6.7.5.1),[§1](https://arxiv.org/html/2606.18273#S1.p4.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.18273#S3.SS3.SSS2.Px3.p1.3)\.
- \[4\]C\. Busso, M\. Bulut, C\. Lee, A\. Kazemzadeh, E\. Mower, S\. Kim, J\. N\. Chang, S\. Lee, and S\. S\. Narayanan\(2008\)IEMOCAP: interactive emotional dyadic motion capture database\.Language resources and evaluation42\(4\),pp\. 335–359\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[5\]G\. Chen, S\. Chai, G\. Wang, J\. Du, W\. Zhang, C\. Weng, D\. Su, D\. Povey, J\. Trmal, J\. Zhang,et al\.\(2021\)Gigaspeech: an evolving, multi\-domain asr corpus with 10,000 hours of transcribed audio\.arXiv preprint arXiv:2106\.06909\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[6\]S\. Chen, C\. Wang, Z\. Chen, Y\. Wu, S\. Liu, Z\. Chen, J\. Li, N\. Kanda, T\. Yoshioka, X\. Xiao,et al\.\(2022\)Wavlm: large\-scale self\-supervised pre\-training for full stack speech processing\.IEEE Journal of Selected Topics in Signal Processing16\(6\),pp\. 1505–1518\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[7\]S\. Chen, Y\. Wu, C\. Wang, S\. Liu, D\. Tompkins, Z\. Chen, W\. Che, X\. Yu, and F\. Wei\(2023\)BEATs: audio pre\-training with acoustic tokenizers\.InInternational Conference on Machine Learning,pp\. 5178–5193\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[8\]W\. Chen, Y\. Liang, Z\. Ma, Z\. Zheng, and X\. Chen\(2024\)EAT: self\-supervised pre\-training with efficient audio transformer\.InProceedings of the Thirty\-Third International Joint Conference on Artificial Intelligence,pp\. 3807–3815\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[9\]Y\. Chu, J\. Xu, Q\. Yang, H\. Wei, X\. Wei, Z\. Guo, Y\. Leng, Y\. Lv, J\. He, J\. Lin,et al\.\(2024\)Qwen2\-audio technical report\.arXiv preprint arXiv:2407\.10759\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p1.1),[§1](https://arxiv.org/html/2606.18273#S1.p5.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p1.1)\.
- \[10\]Y\. Chu, J\. Xu, X\. Zhou, Q\. Yang, S\. Zhang, Z\. Yan, C\. Zhou, and J\. Zhou\(2023\)Qwen\-audio: advancing universal audio understanding via unified large\-scale audio\-language models\.arXiv preprint arXiv:2311\.07919\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[11\]A\. Défossez, J\. Copet, G\. Synnaeve, and Y\. Adi\(2022\)High fidelity neural audio compression\.arXiv preprint arXiv:2210\.13438\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[12\]Y\. Deng, Y\. Choi, and S\. Shieber\(2024\)From explicit cot to implicit cot: learning to internalize cot step by step\.arXiv preprint arXiv:2405\.14838\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[13\]S\. Deshmukh, B\. Elizalde, R\. Singh, and H\. Wang\(2023\)Pengi: an audio language model for audio tasks\.Advances in Neural Information Processing Systems36,pp\. 18090–18108\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[14\]K\. Drossos, S\. Lipping, and T\. Virtanen\(2020\)Clotho: an audio captioning dataset\.InICASSP 2020\-2020 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 736–740\.Cited by:[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[15\]S\. Ghosh, Z\. Kong, S\. Kumar, S\. Sakshi, J\. Kim, W\. Ping, R\. Valle, D\. Manocha, and B\. Catanzaro\(2025\)Audio flamingo 2: an audio\-language model with long\-audio understanding and expert reasoning abilities\.arXiv preprint arXiv:2503\.03983\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[16\]S\. Ghosh, S\. Kumar, A\. Seth, C\. K\. R\. Evuru, U\. Tyagi, S\. Sakshi, O\. Nieto, R\. Duraiswami, and D\. Manocha\(2024\)Gama: a large audio\-language model with advanced audio understanding and complex reasoning abilities\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 6288–6313\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[17\]A\. Giannou, S\. Rajput, J\. Sohn, K\. Lee, J\. D\. Lee, and D\. Papailiopoulos\(2023\)Looped transformers as programmable computers\.InInternational Conference on Machine Learning,pp\. 11398–11442\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]J\. J\. Godfrey, E\. C\. Holliman, and J\. McDaniel\(1992\)SWITCHBOARD: telephone speech corpus for research and development\.In\[Proceedings\] ICASSP\-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing,Vol\.1,pp\. 517–520\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[19\]A\. Goel, S\. Ghosh, J\. Kim, S\. Kumar, Z\. Kong, S\. Lee, C\. H\. Yang, R\. Duraiswami, D\. Manocha, R\. Valle,et al\.\(2025\)Audio flamingo 3: advancing audio intelligence with fully open large audio language models\.arXiv preprint arXiv:2507\.08128\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p1.1),[§1](https://arxiv.org/html/2606.18273#S1.p5.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p1.1)\.
- \[20\]Y\. Gong, H\. Luo, A\. H\. Liu, L\. Karlinsky, and J\. Glass\(2023\)Listen, think, and understand\.arXiv preprint arXiv:2305\.10790\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[21\]Y\. Gong, J\. Yu, and J\. Glass\(2022\)Vocalsound: a dataset for improving human vocal sounds recognition\.InICASSP 2022\-2022 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 151–155\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[22\]J\. Han, K\. Gong, Y\. Zhang, J\. Wang, K\. Zhang, D\. Lin, Y\. Qiao, P\. Gao, and X\. Yue\(2024\)Onellm: one framework to align all modalities with language\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 26584–26595\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[23\]S\. Hao, S\. Sukhbaatar, D\. Su, X\. Li, Z\. Hu, J\. Weston, and Y\. Tian\(2024\)Training large language models to reason in a continuous latent space\.arXiv preprint arXiv:2412\.06769\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[24\]W\. Hsu, B\. Bolte, Y\. H\. Tsai, K\. Lakhotia, R\. Salakhutdinov, and A\. Mohamed\(2021\)Hubert: self\-supervised speech representation learning by masked prediction of hidden units\.IEEE/ACM transactions on audio, speech, and language processing29,pp\. 3451–3460\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[25\]E\. J\. Hu, yelong shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen\(2022\)LoRA: low\-rank adaptation of large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=nZeVKeeFYf9)Cited by:[§3\.4](https://arxiv.org/html/2606.18273#S3.SS4.SSS0.Px1.p1.8)\.
- \[26\]P\. Huang, H\. Xu, J\. Li, A\. Baevski, M\. Auli, W\. Galuba, F\. Metze, and C\. Feichtenhofer\(2022\)Masked autoencoders that listen\.Advances in neural information processing systems35,pp\. 28708–28720\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[27\]L\. Jiang, K\. Rao, S\. Han, A\. Ettinger, F\. Brahman, S\. Kumar, N\. Mireshghallah, X\. Lu, M\. Sap, Y\. Choi, and N\. Dziri\(2024\)WildTeaming at scale: from in\-the\-wild jailbreaks to \(adversarially\) safer language models\.External Links:2406\.18510,[Link](https://arxiv.org/abs/2406.18510)Cited by:[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[28\]C\. D\. Kim, B\. Kim, H\. Lee, and G\. Kim\(2019\)Audiocaps: generating captions for audios in the wild\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),pp\. 119–132\.Cited by:[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[29\]T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa\(2022\)Large language models are zero\-shot reasoners\.Advances in neural information processing systems35,pp\. 22199–22213\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p2.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[30\]Q\. Kong, Y\. Cao, T\. Iqbal, Y\. Wang, W\. Wang, and M\. D\. Plumbley\(2020\)Panns: large\-scale pretrained audio neural networks for audio pattern recognition\.IEEE/ACM Transactions on Audio, Speech, and Language Processing28,pp\. 2880–2894\.Cited by:[Table A](https://arxiv.org/html/2606.18273#A1.T1.6.5.3.1),[§1](https://arxiv.org/html/2606.18273#S1.p4.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.18273#S3.SS3.SSS2.Px1.p1.3)\.
- \[31\]Z\. Kong, A\. Goel, R\. Badlani, W\. Ping, R\. Valle, and B\. Catanzaro\(2024\)Audio flamingo: a novel audio language model with few\-shot learning and dialogue abilities\.arXiv preprint arXiv:2402\.01831\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[32\]S\. Lipping, P\. Sudarsanam, K\. Drossos, and T\. Virtanen\(2022\)Clotho\-aqa: a crowdsourced dataset for audio question answering\.In2022 30th European Signal Processing Conference \(EUSIPCO\),pp\. 1140–1144\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[33\]S\. Liu, A\. S\. Hussain, C\. Sun, and Y\. Shan\(2024\)Music understanding llama: advancing text\-to\-music generation with question answering and captioning\.InICASSP 2024\-2024 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 286–290\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[34\]Z\. Ma, Y\. Ma, Y\. Zhu, C\. Yang, Y\. Chao, R\. Xu,et al\.\(2025\)MMAR: a challenging benchmark for deep reasoning in speech, audio, music, and their mix\.arXiv preprint arXiv:2505\.13032\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[35\]Z\. Ma, Z\. Zheng, J\. Ye, J\. Li, Z\. Gao, S\. Zhang, and X\. Chen\(2024\)Emotion2vec: self\-supervised pre\-training for speech emotion representation\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 15747–15760\.Cited by:[Table A](https://arxiv.org/html/2606.18273#A1.T1.6.6.4.1),[§1](https://arxiv.org/html/2606.18273#S1.p4.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1),[§3\.3\.2](https://arxiv.org/html/2606.18273#S3.SS3.SSS2.Px2.p1.2)\.
- \[36\]N\. Majumder, C\. Hung, D\. Ghosal, W\. Hsu, R\. Mihalcea, and S\. Poria\(2024\)Tango 2: aligning diffusion\-based text\-to\-audio generations through direct preference optimization\.External Links:2404\.09956,[Link](https://arxiv.org/abs/2404.09956)Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[37\]X\. Mei, C\. Meng, H\. Liu, Q\. Kong, T\. Ko, C\. Zhao, M\. D\. Plumbley, Y\. Zou, and W\. Wang\(2024\)Wavcaps: a chatgpt\-assisted weakly\-labelled audio captioning dataset for audio\-language multimodal research\.IEEE/ACM Transactions on Audio, Speech, and Language Processing32,pp\. 3339–3354\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[38\]J\. Melechovsky, Z\. Guo, D\. Ghosal, N\. Majumder, D\. Herremans, and S\. Poria\(2024\)Mustango: toward controllable text\-to\-music generation\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8286–8309\.Cited by:[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[39\]P\. K\. O’Neill, V\. Lavrukhin, S\. Majumdar, V\. Noroozi, Y\. Zhang, O\. Kuchaiev, J\. Balam, Y\. Dovzhenko, K\. Freyberg, M\. D\. Shulman, B\. Ginsburg, S\. Watanabe, and G\. Kucsko\(2021\)SPGISpeech: 5,000 hours of transcribed financial audio for fully formatted end\-to\-end speech recognition\.External Links:2104\.02014,[Link](https://arxiv.org/abs/2104.02014)Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[40\]V\. Panayotov, G\. Chen, D\. Povey, and S\. Khudanpur\(2015\)Librispeech: an asr corpus based on public domain audio books\.In2015 IEEE international conference on acoustics, speech and signal processing \(ICASSP\),pp\. 5206–5210\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[41\]M\. Poli, M\. Luthra, Y\. Benchekroun, Y\. Higuchi, M\. Gleize, J\. Shen, R\. Algayres, Y\. Chung, M\. Assran, J\. Pino, and E\. Dupoux\(2025\)SpidR: learning fast and stable linguistic units for spoken language models without supervision\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856,[Link](https://openreview.net/forum?id=E7XAFBpfZs)Cited by:[Table A](https://arxiv.org/html/2606.18273#A1.T1.6.4.2.1),[§1](https://arxiv.org/html/2606.18273#S1.p4.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1),[§3\.3\.1](https://arxiv.org/html/2606.18273#S3.SS3.SSS1.Px2.p1.2)\.
- \[42\]S\. Poria, D\. Hazarika, N\. Majumder, G\. Naik, E\. Cambria, and R\. Mihalcea\(2019\)Meld: a multimodal multi\-party dataset for emotion recognition in conversations\.InProceedings of the 57th annual meeting of the association for computational linguistics,pp\. 527–536\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[43\]Y\. Qin, B\. Wei, J\. Ge, K\. Kallidromitis, S\. Fu, T\. Darrell, and X\. Wang\(2025\)Chain\-of\-visual\-thought: teaching vlms to see and think better with continuous visual tokens\.arXiv preprint arXiv:2511\.19418\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[44\]A\. Radford, J\. W\. Kim, T\. Xu, G\. Brockman, C\. Mcleavey, and I\. Sutskever\(2023\-23–29 Jul\)Robust speech recognition via large\-scale weak supervision\.InProceedings of the 40th International Conference on Machine Learning,A\. Krause, E\. Brunskill, K\. Cho, B\. Engelhardt, S\. Sabato, and J\. Scarlett \(Eds\.\),Proceedings of Machine Learning Research, Vol\.202,pp\. 28492–28518\.External Links:[Link](https://proceedings.mlr.press/v202/radford23a.html)Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[45\]S\. Sakshi, U\. Tyagi, S\. Kumar, A\. Seth, R\. Selvakumar, O\. Nieto, R\. Duraiswami, S\. Ghosh, and D\. Manocha\(2024\)MMAU: a massive multi\-task audio understanding and reasoning benchmark\.External Links:2410\.19168,[Link](https://arxiv.org/abs/2410.19168)Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[46\]M\. Shih, H\. Chung, Y\. Pai, M\. Hsu, G\. Lin, S\. Li, and H\. Lee\(2023\)Gsqa: an end\-to\-end model for generative spoken question answering\.arXiv preprint arXiv:2312\.09781\.Cited by:[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[47\]Y\. Shu, S\. Dong, G\. Chen, W\. Huang, R\. Zhang, D\. Shi, Q\. Xiang, and Y\. Shi\(2023\)Llasm: large language and speech model\.arXiv preprint arXiv:2308\.15930\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[48\]B\. L\. Sturm\(2013\)The gtzan dataset: its contents, its faults, their effects on evaluation, and its future use\.arXiv preprint arXiv:1306\.1461\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[49\]C\. Tang, W\. Yu, G\. Sun, X\. Chen, T\. Tan, W\. Li, L\. Lu, Z\. Ma, and C\. Zhang\(2023\)Salmonn: towards generic hearing abilities for large language models\.arXiv preprint arXiv:2310\.13289\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p1.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[50\]C\. Wang, M\. Rivière, A\. Lee, A\. Wu, C\. Talnikar, D\. Haziza, M\. Williamson, J\. Pino, and E\. Dupoux\(2021\)VoxPopuli: a large\-scale multilingual speech corpus for representation learning, semi\-supervised learning and interpretation\.External Links:2101\.00390,[Link](https://arxiv.org/abs/2101.00390)Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.
- \[51\]D\. Wang, J\. Wu, J\. Li, D\. Yang, X\. Chen, T\. Zhang, and H\. Meng\(2025\)MMSU: a massive multi\-task spoken language understanding and reasoning benchmark\.arXiv preprint arXiv:2506\.04779\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[52\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2022\)Self\-consistency improves chain of thought reasoning in language models\.arXiv preprint arXiv:2203\.11171\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p2.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[53\]B\. Weck, I\. Manco, E\. Benetos, E\. Quinton, G\. Fazekas, and D\. Bogdanov\(2024\)MuChoMusic: evaluating music understanding in multimodal audio\-language models\.InProceedings of the 25th International Society for Music Information Retrieval Conference \(ISMIR\),Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[54\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p2.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[55\]J\. Xu, Z\. Guo, J\. He, H\. Hu, T\. He, S\. Bai, K\. Chen, J\. Wang, Y\. Fan, K\. Dang, B\. Zhang, X\. Wang, Y\. Chu, and J\. Lin\(2025\)Qwen2\.5\-omni technical report\.arXiv preprint arXiv:2503\.20215\.Cited by:[§1](https://arxiv.org/html/2606.18273#S1.p1.1),[§1](https://arxiv.org/html/2606.18273#S1.p5.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p1.1)\.
- \[56\]Q\. Yang, J\. Xu, W\. Liu, Y\. Chu, Z\. Jiang, X\. Zhou, Y\. Leng, Y\. Lv, Z\. Zhao, C\. Zhou,et al\.\(2024\)Air\-bench: benchmarking large audio\-language models via generative comprehension\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1979–1998\.Cited by:[Appendix A](https://arxiv.org/html/2606.18273#A1.SS0.SSS0.Px3.p1.1)\.
- \[57\]L\. Yizhi, R\. Yuan, G\. Zhang, Y\. Ma, X\. Chen, H\. Yin, C\. Xiao, C\. Lin, A\. Ragni, E\. Benetos,et al\.\(2024\)MERT: acoustic music understanding model with large\-scale self\-supervised training\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[58\]N\. Zeghidour, A\. Luebs, A\. Omran, J\. Skoglund, and M\. Tagliasacchi\(2021\)Soundstream: an end\-to\-end neural audio codec\.IEEE/ACM Transactions on Audio, Speech, and Language Processing30,pp\. 495–507\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1)\.
- \[59\]E\. Zelikman, G\. Harik, Y\. Shao, V\. Jayasiri, N\. Haber, and N\. D\. Goodman\(2024\)Quiet\-star: language models can teach themselves to think before speaking\.arXiv preprint arXiv:2403\.09629\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px2.p1.1)\.
- \[60\]J\. Zhan, J\. Dai, J\. Ye, Y\. Zhou, D\. Zhang, Z\. Liu, X\. Zhang, R\. Yuan, G\. Zhang, L\. Li,et al\.\(2024\)Anygpt: unified multimodal llm with discrete sequence modeling\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 9637–9662\.Cited by:[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px1.p1.1)\.
- \[61\]K\. Zhang, B\. Li, P\. Zhang, F\. Pu, J\. A\. Cahyono, K\. Hu, S\. Liu, Y\. Zhang, J\. Yang, C\. Li, and Z\. Liu\(2024\)LMMs\-eval: reality check on the evaluation of large multimodal models\.arXiv preprint arXiv:2407\.12772\.Cited by:[Appendix B](https://arxiv.org/html/2606.18273#A2.p1.2)\.
- \[62\]X\. Zhang, L\. Li, X\. Lu, J\. Liu, and K\. A\. Lee\(2026\)Speaking clearly: a simplified whisper\-based codec for low\-bitrate speech coding\.InICASSP 2026\-2026 IEEE International Conference on Acoustics, Speech and Signal Processing \(ICASSP\),pp\. 17037–17041\.Cited by:[Table A](https://arxiv.org/html/2606.18273#A1.T1.6.3.1.1),[§2](https://arxiv.org/html/2606.18273#S2.SS0.SSS0.Px3.p1.1),[§4\.4](https://arxiv.org/html/2606.18273#S4.SS4.SSS0.Px3.p1.1)\.
- \[63\]Z\. Zhao, Y\. Jiang, H\. Liu, Y\. Wang, and Y\. Wang\(2024\)Librisqa: a novel dataset and framework for spoken question answering with large language models\.IEEE Transactions on Artificial Intelligence\.Cited by:[§4\.1](https://arxiv.org/html/2606.18273#S4.SS1.p2.1)\.

## Appendix AExperimental Details

##### Training configuration\.

All three backbones produce audio token embeddings atrs=25r\_\{s\}=25Hz, with LM hidden sized=4096d=4096for Qwen2\-Audio andd=3584d=3584for Qwen2\.5\-Omni\-7B and Audio Flamingo 3\. Throughout the main CoAT training runs, the audio tower and all expert encoders are kept frozen\. Gradients update only the LoRA adapters and the per\-task projection heads\. We use LoRA adapters of rank1616withα=32\\alpha=32\. Each projection headPkP\_\{k\}is a single\-block Transformer with multi\-head attention, a feed\-forward layer, and a linear map to the corresponding expert embedding dimension\.

CoAT is trained with a two\-stage schedule\. The first stage is a reconstruction\-only warm\-up that aligns the thinking states with the audio\-token distribution\. The second stage activates all five expert losses together with the language\-modeling loss\. The loss weight for each expert compensates for differences in output scale, and the same optimizer, learning rate, LoRA configuration, and training schedule are used for every backbone\. Training runs on4×4\\\!\\timesNVIDIA B200 GPUs with an effective batch size of1616, taking approximately8888B200 GPU\-hours per backbone for the full two\-stage schedule\. The full evaluation suite requires approximately1515B200 GPU\-hours per evaluated model\. Table[A](https://arxiv.org/html/2606.18273#A1.T1)lists the expert encoders, and Table[B](https://arxiv.org/html/2606.18273#A1.T2)lists the training schedule and hyperparameters\.

Table A:Audio expert encoders used in CoAT\.eke\_\{k\}is the expert embedding dimension andrkr\_\{k\}is the frame rate at which the expert emits features\.Table B:Training schedule\.CoAT is trained in two stages that share the same optimizer, learning rate, and batch configuration\. Only the active expert set and the per\-expert loss weights change\. Stage 1 is a reconstruction\-only warm\-up, and stage 2 activates all five experts\. The basic\_pitch expert carries two coupled losses, an MSE on its dense intermediate feature and an auxiliary focal\-BCE on the 264\-bin pitch posteriorgram\.Table C:Training corpus and single\-stage sampling budget \(1\.6M target samples\)\. Pool is the unique row count per source\. Subtotal aggregates the pool within each category\. Share is the per\-category mixing ratio; Samples is the resulting sample budget\.
##### Training datasets\.

Training data are drawn from publicly available sources organized into seven task groups: automatic speech recognition, audio and speech question answering, audio captioning, multiple\-choice audio understanding, music understanding, spoken\-instruction following, and a small text\-only supervised fine\-tuning split\. The text\-only split is included to preserve instruction\-following ability and to prevent catastrophic over\-refusal on harmful prompts, we observed that both abilities degraded rapidly under audio\-only multi\-task training\. After fixing the per\-task sampling ratio, we draw a training subset sized so that the full step schedule corresponds to one epoch over the subset\. To prevent overfitting on smaller tasks, we additionally cap the maximum number of epochs that any single task can repeat within this subset\.

Table[C](https://arxiv.org/html/2606.18273#A1.T3)lists the full training corpus together with the sampling ratios used to materialize a single stage of1,600,0001\{,\}600\{,\}000shuffled rows\. Each category draws independently from its constituent sources at uniform probability, so per source budgets within a category scale with pool size\. The number of optimization steps is configured so that effective batch size multiplied by total steps matches the stage size, giving each row in the materialized file approximately one expected pass per stage and avoiding within stage oversampling\. The spoken instruction following category pools the evaluation aligned GSQA prompts and the text only WildJailbreak split is included to anchor refusal behavior on harmful prompts while preventing over refusal on benign adversarial ones\.

##### Evaluation suite\.

We evaluate CoAT on a broad benchmark suite organized into five families that mirror the rows of Table[1](https://arxiv.org/html/2606.18273#S4.T1)\. General audio understanding and reasoning uses MMAU\[[45](https://arxiv.org/html/2606.18273#bib.bib43)\], MMAR\[[34](https://arxiv.org/html/2606.18273#bib.bib48)\], MMSU\[[51](https://arxiv.org/html/2606.18273#bib.bib49)\], ClothoAQA\[[32](https://arxiv.org/html/2606.18273#bib.bib59)\], Alpaca\-Audio\[[36](https://arxiv.org/html/2606.18273#bib.bib45)\], and WavCaps\[[37](https://arxiv.org/html/2606.18273#bib.bib57)\]\. AIR\-Bench Foundation\[[56](https://arxiv.org/html/2606.18273#bib.bib64)\]contributes its three subsets covering speech, sound, and music\. Music classification spans VocalSound\[[21](https://arxiv.org/html/2606.18273#bib.bib51)\], GTZAN\[[48](https://arxiv.org/html/2606.18273#bib.bib53)\], and MuchoMusic\[[53](https://arxiv.org/html/2606.18273#bib.bib50)\]\. Speech emotion recognition is measured on MELD\[[42](https://arxiv.org/html/2606.18273#bib.bib54)\]and IEMOCAP\[[4](https://arxiv.org/html/2606.18273#bib.bib55)\]\. Speech transcription uses LibriSpeech\[[40](https://arxiv.org/html/2606.18273#bib.bib40)\], Common Voice 15\[[1](https://arxiv.org/html/2606.18273#bib.bib47)\], GigaSpeech\[[5](https://arxiv.org/html/2606.18273#bib.bib41)\], VoxPopuli\[[50](https://arxiv.org/html/2606.18273#bib.bib46)\], SPGISpeech\[[39](https://arxiv.org/html/2606.18273#bib.bib44)\], and Switchboard\[[18](https://arxiv.org/html/2606.18273#bib.bib42)\]\. MELD reports both accuracy and class\-support\-weighted F1, following the convention of the MELD paper for class\-imbalanced 7\-way emotion classification\.

## Appendix BEvaluation Protocol Details

All inference uses vLLM as the backend with greedy decoding \(T=0T=0, top\-p=0\.95p=0\.95\)\. For tasks supplied by upstreamlmms\-eval\[[61](https://arxiv.org/html/2606.18273#bib.bib65)\]we adopt the released implementation as\-is\. For benchmarks that are either not natively supported by upstream or whose upstream implementation we found to mis\-handle the audio, the prompt template, or the metric, we use our own task definitions\. These custom definitions cover IEMOCAP, MELD, SPGISpeech, Switchboard, GTZAN, and VocalSound\.

MELD’s seven emotion classes are highly imbalanced, with neutral alone accounting for roughly half of the test set, so accuracy alone can be inflated by a model that always predicts the majority class\. We therefore report a weighted F1 score, defined as the average of per\-class F1 with each class weighted by its support \(i\.e\., the number of test samples in that class\)\.

Table D:Per\-benchmark evaluation protocol\.MMdenotesmax\_new\_tokensas configured inlmms\-eval\. LLM judges use the official prompt of each benchmark, reproduced verbatim\.FamilyBenchmarkSplitMMMetric / JudgeGeneral AudioMMAUtest\_mini128AccuracyMMARtest128AccuracyMMSUtest256AccuracyClothoAQAtest8Exact matchAlpaca\-Audiotest1024GPT\-4oWavCapstest1024GPT\-4oAIR\-Bench FoundationABF Speechfoundation256AccuracyABF Soundfoundation256AccuracyABF Musicfoundation256AccuracyMusic classificationVocalSoundtest20AccuracyGTZANtest20AccuracyMuchoMusictest256AccuracySpeech emotion recognitionMELDtest20Accuracy / F1IEMOCAPSession 5 \(LOSO\)20AccuracyAutomatic Speech RecognitionLibriSpeech\-clean / \-othertest256WER↓\\downarrowCommon Voice 15test \(en\)256WER↓\\downarrowGigaSpeechtest256WER↓\\downarrowVoxPopulitest \(en\)2048WER↓\\downarrowSPGISpeechtest512WER↓\\downarrowSwitchboardeval2000512WER↓\\downarrowAudio clips longer than120120seconds are skipped at data\-loading\. This affects only a negligible fraction of inputs and is applied identically to every model, so it does not affect relative comparisons\. GigaSpeech references additionally contain meta\-tags \(<MUSIC\>,<NOISE\>,<SIL\>,<OTHER\>\) that are stripped before scoring\. Samples whose reference becomes empty after this stripping are excluded from the WER aggregate, since an empty reference is unscorable and would otherwise count every emitted word as a pure insertion error\. No analogous empty\-reference filter is applied to the other ASR benchmarks, which do not use such tags\.

Figure[A](https://arxiv.org/html/2606.18273#A2.F1)shows the LLM judge configuration and prompt used for WavCaps and Alpaca\-Audio, with model and decoding hyperparameters listed above the prompt box\. Inside the box,System\.andUser\.mark the chat roles,\[Section\]markers are part of the literal prompt, anditalic bluetext denotes per\-sample variables\. We normalized the 5\-point scale scores to a 100\-point scale\.

Alpaca\-Audio & WavCaps \(single response, 0–100\)

Judge model:gpt\-4o\-2024\-11\-20\|Decoding:temperature0, max tokens10241024\|System message:none \|Variables:\{question\},\{ground\_truth\},\{model\_response\}\.

LLM Judge prompt for Alpaca\-Audio & WavCapsUser\. \[Question\] \{question\}\[Reference Answer\] \{ground\_truth\}\[Model Answer\] \{model\_response\}\[Task\] Rate the model’s answer based on its alignment with the reference answer, focusing on accuracy and relevance to the reference provided\. Please be critical on the details\. Criteria: Assess if the model’s response mirrors the reference in terms of content, accuracy, and relevance\.Score 0:The answer is completely misaligned, providing incorrect or irrelevant information compared to the reference\.Score 1:The answer shows minimal alignment, often misunderstanding or providing irrelevant details unrelated to the reference\.Score 2:The answer recognizes the topic but diverges significantly from the reference in accuracy or relevance\.Score 3:The answer aligns with the reference generally but lacks detail or precise accuracy in some aspects\.Score 4:The answer is mostly accurate and relevant, closely following the reference but could be clearer or more detailed\.Score 5:The answer is highly accurate, detailed, and matches the reference answer perfectly, capturing its essence and detail\.Your response should be formatted as follows:Explanation:\(Provide a concise explanation of your rating, comparing the reference answer with the model’s response\. “The reference answer is\[XXX\], while the model’s answer is\[YYY\]\. I think …”\)Rating:\(int\)

Figure A:LLM judge configuration and prompt for Alpaca\-Audio & WavCaps\.
## Appendix CInference Cost Analysis

Table E:Inference cost by audio duration\.Per\-duration mean±\\pmstd across 3 random seeds, with 300 samples per range per seed\. TTFT, decode, and total are reported in seconds\. Forced prefix is the length in tokens of CoAT’s prepended thinking block\. “–” marks variants that do not prepend a thinking block\. Hardware and serving configuration follow Appendix[C](https://arxiv.org/html/2606.18273#A3)\.##### Setup\.

Two complementary measurements support Table[2](https://arxiv.org/html/2606.18273#S4.T2)\. For the main table, we run all variants on the full MMAU and MMAR evaluation sets \(2,0002\{,\}000samples total\) and report per\-sample mean latency\. For the duration\-stratified analysis, we draw a fixed300300samples per audio\-duration range from the broader evaluation suite, which is otherwise heavily skewed toward short audio with over 90% of samples below 15 s, giving1,5001\{,\}500samples spanning 0 to 120 s, and repeat the sweep with three independent random seeds\. Inference is run with batch size11on4×4\\timesNVIDIA B200 \(tensor parallel size44\) under vLLM 0\.19, with prefix caching disabled so that every request pays its full prefill cost\. Note that the performance of the AF3\-think model is evaluated with GPT\-eval due to the answer parsing problem\.

##### Per\-duration results\.

Table[E](https://arxiv.org/html/2606.18273#A3.T5)reports time to first token, decode time, total wall\-clock, decoded tokens, and CoAT’s forced\-prefix length for each audio\-duration range as mean±\\pmstd across 3 random seeds\. Two patterns stand out\. First, CoAT’s forced\-prefix length grows monotonically with audio duration and spans more than an order of magnitude across the five ranges\. Despite this growth, CoAT’s time to first token stays within a few milliseconds of the same\-backbone baseline in every range, indicating that the additional prefill compute is largely absorbed by the audio encoder’s existing cost\. Second, text\-CoT’s decoded\-token count stays roughly flat across all five ranges, reflecting that the textual chain\-of\-thought is determined by task format rather than audio duration\. Its decode wall\-clock is correspondingly stable across ranges and contributes the dominant cost regardless of input length\.

## Appendix DInference\-Time Think Token vs\. CoAT

Table F:AF3 think\-token comparison\.Per\-benchmark performance of Audio Flamingo 3 with the inference\-time \+ think token policy and our \+ CoAT fine\-tune\. Since AF3 think model does not support adaptive thinking mode, we cannot evaluate ASR with WER and therefore omit this value\.BenchmarkEvalAudio Flamingo 3\+ think\+ CoATΔ\\DeltaAudio Understanding & ReasoningGeneralMMAUAcc↑\\uparrow69\.4064\.5270\.00\+0\.60\+0\.60/\+5\.48\+5\.48MMARAcc↑\\uparrow55\.7054\.2559\.60\+3\.90\+3\.90/\+5\.35\+5\.35MMSUAcc↑\\uparrow60\.0157\.0358\.36−1\.65\-1\.65/\+1\.33\+1\.33ClothoAQAAcc↑\\uparrow80\.1068\.7985\.30\+5\.20\+5\.20/\+16\.51\+16\.51Alpaca\-AudioGPT↑\\uparrow38\.8030\.1058\.59\+19\.79\+19\.79/\+28\.49\+28\.49WavCapsGPT↑\\uparrow33\.4041\.0029\.00−4\.40\-4\.40/−12\.00\-12\.00AIR\-Bench FoundationABF SpeechAcc↑\\uparrow62\.9958\.0471\.24\+8\.25\+8\.25/\+13\.20\+13\.20ABF SoundAcc↑\\uparrow65\.5663\.3969\.50\+3\.94\+3\.94/\+6\.11\+6\.11ABF MusicAcc↑\\uparrow58\.6254\.4064\.50\+5\.88\+5\.88/\+10\.10\+10\.10Music ClassificationVocalSoundAcc↑\\uparrow93\.0682\.2292\.39−0\.67\-0\.67/\+10\.17\+10\.17GTZANAcc↑\\uparrow94\.9992\.8995\.50\+0\.51\+0\.51/\+2\.61\+2\.61MuchoMusicAcc↑\\uparrow81\.6375\.7681\.80\+0\.17\+0\.17/\+6\.04\+6\.04Speech Emotion RecognitionMELDACC↑\\uparrow40\.7540\.4659\.83\+19\.08\+19\.08/\+19\.47\+19\.47MELDF1↑\\uparrow45\.9344\.9657\.16\+11\.23\+11\.23/\+12\.20\+12\.20IEMOCAPAcc↑\\uparrow63\.5856\.9870\.39\+6\.81\+6\.81/\+13\.41\+13\.41Automatic Speech RecognitionLibriSpeech\-cleanWER↓\\downarrow1\.57\-1\.99\+0\.42\+0\.42/ \-LibriSpeech\-other3\.13\-4\.23\+1\.10\+1\.10/ \-Common Voice 157\.40\-7\.40\+0\.00\+0\.00/ \-GigaSpeech10\.27\-11\.90\+1\.63\+1\.63/ \-VoxPopuli5\.55\-5\.70\+0\.15\+0\.15/ \-SPGISpeech1\.86\-1\.84−0\.02\-0\.02/ \-Switchboard8\.01\-7\.18−0\.83\-0\.83/ \-A natural alternative to CoAT is to ask whether the gains we attribute to our reconstruction teachers can in fact be obtained for free, simply by allocating extra inference\-time compute to a think segment\. On the Audio Flamingo 3 backbone we isolate this question by comparing two settings: an inference\-time think policy that prepends a think segment to the frozen pretrained model, and our CoAT fine\-tune that supervises the same segment with multi\-teacher reconstruction targets\. Table[F](https://arxiv.org/html/2606.18273#A4.T6)reports per\-benchmark scores together with the gapΔ\\Deltabetween CoAT and another model\.

The Audio Flamingo 3 denotes the vanilla AF3 model, whereas “think” or “with CoT” refers the AF3 with additional thinking\. The two policies behave very differently\. Inference\-time thinking by itself is largely neutral or harmful, regressing on a clear majority of benchmarks across audio understanding, music, and emotion families, with the largest drops concentrated on captioning and open\-ended QA\. CoAT, in contrast, delivers consistent gains on the metrics that exercise audio understanding—most prominently in voice\-assistant style QA, AIR\-Bench Foundation, and speech emotion recognition—at the cost of a modest WER regression on most ASR test sets, consistent with mild distribution shift from fine\-tuning on a multi\-task corpus\. The takeaway is that the benefit of CoAT comes from what the think segment is supervised to reconstruct, not from the segment’s mere presence at inference\.

## Appendix ESocietal Impact

CoAT improves audio understanding in audio language models, with potential positive applications such as accessibility \(captioning, transcription, assistive listening\) and content moderation\. As with audio language models in general, the same improvements may have dual\-use implications that warrant consideration at deployment time\. These considerations apply at the level of the underlying backbone rather than being introduced by CoAT itself\.

Similar Articles

Continuous Audio Language Models

Papers with Code Trending

This paper introduces Continuous Audio Language Models (CALM), which generate audio using continuous frames instead of discrete tokens to improve fidelity and reduce computational cost in speech and music generation.

Attribution-Guided Continual Learning for Large Language Models

arXiv cs.LG

This paper proposes an attribution-guided continual fine-tuning framework for large language models that estimates task-specific parameter importance in Transformer layers and modulates gradients accordingly, mitigating catastrophic forgetting while maintaining performance on new tasks.