Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

arXiv cs.CL 06/04/26, 04:00 AM Papers

speech-llm chain-of-thought entity-binding reasoning multimodal speech-understanding

Summary

This paper identifies a localized 'entity binding failure' in Speech Large Language Models (SLLMs) where logical reasoning involving entity tracking collapses to chance-level accuracy, and proposes Entity-Aware Chain-of-Thought (EA-CoT) prompting to resolve this, achieving up to 24.4% absolute accuracy improvement.

arXiv:2606.04474v1 Announce Type: new Abstract: Speech Large Language Models (SLLMs) underperform their text counterparts on complex reasoning. We reveal that this modality gap is not a uniform cognitive deficit. Evaluating three diverse SLLMs, we show speech-to-text (S2T) matches or exceeds text-to-text (T2T) on spatial, syntactic, and factual tasks. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity-property associations during implicit reasoning. To resolve this, we propose Entity-Aware Chain-of-Thought (EA-CoT), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning. Strikingly, EA-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24.4% absolute accuracy improvement. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck.

Original Article

View Cached Full Text

Cached at: 06/05/26, 02:14 AM

# Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
Source: [https://arxiv.org/abs/2606.04474](https://arxiv.org/abs/2606.04474)
[View PDF](https://arxiv.org/pdf/2606.04474)

> Abstract:Speech Large Language Models \(SLLMs\) underperform their text counterparts on complex reasoning\. We reveal that this modality gap is not a uniform cognitive deficit\. Evaluating three diverse SLLMs, we show speech\-to\-text \(S2T\) matches or exceeds text\-to\-text \(T2T\) on spatial, syntactic, and factual tasks\. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance\. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity\-property associations during implicit reasoning\. To resolve this, we propose Entity\-Aware Chain\-of\-Thought \(EA\-CoT\), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning\. Strikingly, EA\-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24\.4% absolute accuracy improvement\. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck\.

## Submission history

From: Ming\-Hao Hsu \[[view email](https://arxiv.org/show-email/da6e2e4b/2606.04474)\] **\[v1\]**Wed, 3 Jun 2026 05:44:09 UTC \(73 KB\)

Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention

Similar Articles

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

Cell-Based Representation of Relational Binding in Language Models

Constraint-Anchored Reasoning Traces

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning

Submit Feedback

Similar Articles

Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations

Where Instruction Hierarchy Breaks: Diagnosing and Repairing Failures in Reasoning Language Models

Cell-Based Representation of Relational Binding in Language Models

Constraint-Anchored Reasoning Traces

Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning