Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention
Summary
This paper identifies a localized 'entity binding failure' in Speech Large Language Models (SLLMs) where logical reasoning involving entity tracking collapses to chance-level accuracy, and proposes Entity-Aware Chain-of-Thought (EA-CoT) prompting to resolve this, achieving up to 24.4% absolute accuracy improvement.
View Cached Full Text
Cached at: 06/05/26, 02:14 AM
# Entity Binding Failures in Speech LLM Reasoning: Diagnosis and Chain-of-Thought Intervention Source: [https://arxiv.org/abs/2606.04474](https://arxiv.org/abs/2606.04474) [View PDF](https://arxiv.org/pdf/2606.04474) > Abstract:Speech Large Language Models \(SLLMs\) underperform their text counterparts on complex reasoning\. We reveal that this modality gap is not a uniform cognitive deficit\. Evaluating three diverse SLLMs, we show speech\-to\-text \(S2T\) matches or exceeds text\-to\-text \(T2T\) on spatial, syntactic, and factual tasks\. However, on logical tasks requiring entity tracking, S2T accuracy collapses to chance\. We diagnose this localized degradation as an entity binding failure: continuous speech features cause models to lose precise entity\-property associations during implicit reasoning\. To resolve this, we propose Entity\-Aware Chain\-of\-Thought \(EA\-CoT\), forcing SLLMs to explicitly enumerate entities and bind them to claims before reasoning\. Strikingly, EA\-CoT bridges the gap, even when spoken names are misrecognized, yielding up to a 24\.4% absolute accuracy improvement\. Ablations confirm these gains stem entirely from explicit semantic binding, reframing the gap as a resolvable bottleneck\. ## Submission history From: Ming\-Hao Hsu \[[view email](https://arxiv.org/show-email/da6e2e4b/2606.04474)\] **\[v1\]**Wed, 3 Jun 2026 05:44:09 UTC \(73 KB\)
Similar Articles
Fragile Thoughts: How Large Language Models Handle Chain-of-Thought Perturbations
This paper presents a comprehensive empirical evaluation of how large language models handle corruptions in chain-of-thought reasoning steps, testing 13 models across 5 perturbation types (MathError, UnitConversion, Sycophancy, SkippedSteps, ExtraSteps) on mathematical reasoning tasks. The findings reveal heterogeneous vulnerability patterns with implications for deploying LLMs in multi-stage reasoning pipelines.
Cell-Based Representation of Relational Binding in Language Models
Study reveals that LLMs encode discourse-level relational binding through Cell-based Binding Representation (CBR), a low-dimensional linear subspace where each cell maps to entity-relation pairs, providing causal evidence for how models track entities and relations.
Agentic Chain-of-Thought Steering for Efficient and Controllable LLM Reasoning
ACTS (Agentic Chain-of-Thought Steering) formulates LLM reasoning control as a Markov decision process where a controller agent adaptively steers a frozen reasoner during inference using reasoning strategies and steering phrases. The approach achieves comparable accuracy to full-thinking models with significant token savings, enabling controllable accuracy-efficiency trade-offs.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
Reasoning Can Be Restored by Correcting a Few Decision Tokens
This paper shows that the reasoning gap between base LLMs and large reasoning models is concentrated on a small set of early planning tokens. It introduces disagreement-guided token intervention, where replacing only those critical tokens with a reasoning model's outputs allows a base model to nearly match the reasoning model's performance.