LACE: Lattice Attention for Cross-thread Exploration
Summary
LACE introduces a lattice attention mechanism that enables concurrent reasoning paths in LLMs to share intermediate insights and correct errors during inference, improving reasoning accuracy by over 7 points compared to standard isolated parallel sampling.
View Cached Full Text
Cached at: 04/20/26, 08:33 AM
# LACE: Lattice Attention for Cross-thread Exploration
Source: https://arxiv.org/html/2604.15529
## Abstract
Current large language models reason in isolation. Although it is common to sample multiple reasoning paths in parallel, these trajectories do not interact, and often fail in the same redundant ways. We introduce LACE, a framework that transforms reasoning from a collection of independent trials into a coordinated, parallel process. By repurposing the model architecture to enable cross-thread attention, LACE allows concurrent reasoning paths to share intermediate insights and correct one another during inference. A central challenge is the absence of natural training data that exhibits such collaborative behavior. We address this gap with a synthetic data pipeline that explicitly teaches models to communicate and error-correct across threads. Experiments show that this unified exploration substantially outperforms standard parallel search, improving reasoning accuracy by over 7 points. Our results suggest that large language models can be more effective when parallel reasoning paths are allowed to interact.
Machine Learning, ICML
[Refer to caption]
Figure 1: Comparison between standard isolated sampling and our LACE framework. Blue lines represent common causal attention, while purple lines denote our lattice attention. Top: Traditional LLMs perform reasoning in isolation, where these independent threads lack communication, often leading them fail in identical ways. Bottom: LACE introduces Lattice Attention, which enables cross-thread interaction. This architecture allows concurrent paths to share intermediate insights and error-correct on-the-fly, eliciting synergetic reasoning that ensures diverse exploration and higher likelihood of discovering the optimal solution.
## 1 Introduction
Reasoning is rarely a straight line. When humans tackle a difficult proof or a complex plan, we don't commit to a single path in a vacuum (Kahneman, 2011). We explore multiple hypotheses at once, allowing a failure in one thread to pivot the search in another, exhibiting both cognitively (Johnson-Laird, 2010) and neurophysiologically (Kay et al., 2020). We navigate uncertainty through this constant, internal dialogue, which is a process of collateral thinking.
In contrast, Large Language Models are confined to strictly sequential generation (Vaswani et al., 2017) where each thread is independent of each other, generating k distinct solutions to select the best one (Cobbe et al., 2021; Zheng et al., 2025; Wu et al., 2025). Yet, this approach is inherently inefficient due to isolation at sampling. Like k agents solving a puzzle in separate rooms, the threads cannot share breakthroughs or warn against dead ends, which is shown in Figure 1. Consequently, computation is wasted on redundancy rather than synergy; threads frequently fall into correlated errors (Kim et al., 2025), because isolation hurts diversity, reducing the likelihood of discovering optimal answers. Moreover, post-hoc verification of isolated samples is not only inefficient but also prone to internal biases, as the model may consistently favor certain flawed reasoning patterns in all independent threads (Madaan et al., 2023).
This motivates our central question: **Can reasoning threads communicate during generation to share in-situ insights and solve problems collaboratively?**
In this paper, we introduce **LACE** (Lattice Attention for Cross-thread Exploration), a framework that reconfigures the transformer to support collateral thinking. We generalize the standard 1-D causal attention into a 2-D Lattice Attention structure, introducing a width dimension that allows information to flow not just across time (tokens), but across threads. This architecture transforms reasoning from a set of isolated, independent events into a unified, collaborative exploration within a single forward pass. By bridging these threads, LACE actively diversifies search strategies to prevent redundant failures (i.e., stumbling over the same stone) and enables in-situ evaluation to on-the-fly identify and select the optimal solution. As shown in Figure 1, it is a capability isolated samplers cannot achieve by design.
Realizing this capability requires overcoming a significant data scarcity challenge. Standard pre-training corpora lack multi-threaded reasoning. Unlike conventional single-thread data, LACE requires parallel threads that are correlated yet logically diverse, which is a regime largely unexplored by the community. Datasets that produce multiple solutions via high-temperature sampling (Toshniwal et al., 2024b), few-shot prompting (Toshniwal et al., 2024a; Zheng et al., 2025), or multi-model sampling (Muennighoff et al., 2025) yield superficial rephrasings rather than logically diverse reasoning paths. Moreover, they are self-contained, mutually independent traces. Meanwhile, most parallelized datasets (Wu et al., 2025) contain only interleave short parallel segments within an otherwise sequential trace, severely limiting the scope for sustained cross-thread collaboration.
Driven by these, we present a synthetic data curation pipeline that constructs parallel traces with explicit interaction points. In contrast to naïve high-temperature sampling which yields superficial rephrasing, our pipeline generates threads with inherently distinct solution paths. We leverage this data for continuous pre-training, followed by Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL). Specifically, we introduce random jittering during SFT to compel cross-thread information sharing, and design specialized RL rewards that incentivize threads to explore diversely and self-evaluate their solutions to maximize accuracy.
Visualizations and numerical results show that our method emergently learns to share insights across threads and pick the best solution on-the-fly, outperforming independent sampling baselines on challenging reasoning benchmarks such as AIME 25 (Zhang & Math-AI, 2025), AIME 24 (Zhang & Math-AI, 2024), and LiveBench (White et al., 2025). Notably, these capabilities emerge even though the model is trained solely on our synthetic data, demonstrating robust generalization to real-world tasks.
Our contributions are summarized as follows:
- We propose **Lattice Attention**, a principled generalization of 1-D causal attention to 2-D that enables multiple reasoning threads to communicate on-the-fly during generation, transforming isolated parallel sampling into unified collaborative exploration.
- We design a **multi-thread post-training framework** with a dedicated **synthetic data pipeline** that generates correlated yet logically diverse reasoning threads, addressing the critical data scarcity challenge and incentivizing collaborative exploration.
- We demonstrate that LACE significantly outperforms independent sampling baselines on challenging benchmarks, with emergent cross-thread communication and on-the-fly solution self-picking capabilities that generalize beyond synthetic training data.
## 2 Related Work
#### External Test-Time Search and Scaling
The paradigm of large language models is shifting from scaling parameters to scaling inference-time compute (Snell et al., 2024; Wu et al., 2024), with recent studies showing that additional reasoning-time computation, often framed as System 2 thinking, can yield gains comparable to scaling model size (Li et al., 2025). Existing methods largely realize this extra compute through external search or orchestration, including sequential reasoning strategies such as Chain-of-Thought and its variants (Wei et al., 2022b; Kojima et al., 2022; Zhou et al., 2022), structured search methods such as Tree-of-Thoughts (Yao et al., 2023), and parallel sampling pipelines based on self-consistency (Wang et al., 2022), bootstrapped Best-of-N aggregation (Rakhsha et al., 2025), learned verifiers or reward models (Stiennon et al., 2020; Cobbe et al., 2021), RL-based parallel reasoning frameworks such as Parallel-R1 and Native Parallel Reasoner (Zheng et al., 2025; Wu et al., 2025), and width-oriented external parallel thinking such as ParaThinker (Wen et al., 2025). Despite their differences, these approaches coordinate reasoning primarily outside the model's native token-level generation process, and remain vulnerable to biased or correlated failures across independently sampled trajectories (Ichihara et al., 2025; Huang et al., 2023; Madaan et al., 2023).
#### Generation-Time Parallel Reasoning
More recent work has begun to move parallel reasoning into the generation process itself. ParaDecode OneSeq (Yu et al., 2025b) packs multiple branches into a single sequence for efficiency-oriented decoding. Hogwild! (Rodionov et al., 2025) enables explicit concurrent attention through shared cross-thread KV states, while GroupThink (Hsu et al., 2025) studies token-level collaboration among concurrent reasoning agents via direct cross-thread interaction. While LACE shares the goal of generation-time cross-thread interaction with these methods, it differs in mechanism. Rather than exposing each thread to token histories of other threads through shared KV states and non-standard masks, it introduces *implicit* cross-thread interaction through a lightweight gated side path over standard attention-derived representations. This preserves the standard causal-attention backbone while allowing threads to influence one another during generation, turning independent sampling into a collaborative process that explicitly targets redundant exploration and correlated errors (Kim et al., 2025; Hong & Page, 2004).
## 3 Method
We introduce our method LACE in three sections: Section 3.1 details the Lattice Attention mechanism, Section 3.2 outlines the training framework, and Section 3.3 describes the data curation pipeline. An overview of the entire training pipeline is illustrated in Figure 2.
[Refer to caption]
Figure 2: Overview of Lattice Attention Layers. t denotes thread index, and l denotes token position. Lattice attention enables cross-thread communication by attending over the outputs of standard attention across different threads at aligned token positions. The resulting cross-thread context is fused back into the main path via a gated residual connection. This design allows each thread to benefit from the reasoning progress of its peers while preserving the original causal structure along the token axis.
### 3.1 Lattice Attention
#### Notation
Let B denote the batch size, N the number of threads, L the context length, and D the hidden dimension. We denote the head dimension as d and use subscript p for lattice attention components.
#### Overview
To enable information flow across different threads, we propose Lattice Attention, an additional attention mechanism that operates along the thread dimension, perpendicular to the token axis. A naive design that directly applies cross-thread attention over input embeddings faces both computational and parametric challenges. Tokens at aligned positions across threads may not correlate strongly due to asynchronous reasoning progress, while extending the attention context would incur quadratic computational overhead. Moreover, training additional layers from scratch is prone to disturbing the well-learned causal attention layers, especially with limited data. These challenges drive us to propose a context-aware and parameter-efficient architecture.
Instead of attending over raw embeddings, we operate on the output of standard scaled dot-product attention (SDPA) $\mathbf{A}_{\text{std}} \in \mathbb{R}^{(BN) \times L \times D_a}$, which already encodes rich contextual information via causal attention. This design choice enables LACE to inherit the effective context length of standard attention while avoiding redundant computation.
To achieve parameter efficiency, we employ three key strategies. First, SDPA is projected to a lower-dimensional space via lightweight downsampling before cross-thread attention, significantly reducing the computation cost. Second, inspired by ControlNet (Zhang et al.), we selectively plug lattice attention layers into middle-to-last layers of the base model, where cross-thread communication is most beneficial for complex reasoning (Yang et al., 2024). Third, we adopt residual connections with learned gating to modulate the contribution of lattice attention, allowing the model to dynamically balance between thread-independent and thread-aware processing. Together, these strategies limit additional parameters to less than 11% of the original model while enabling effective cross-thread information exchange with minimal disruption to the pre-trained causal layers.
#### Cross-Thread Attention
We project the standard attention output and compute parallel QKV:
$$\mathbf{Z} = \mathbf{A}_{\text{std}} \mathbf{W}_{\text{down},p}$$
$$\mathbf{Q}_p, \mathbf{K}_p, \mathbf{V}_p = \text{Proj}(\mathbf{Z})$$
To encode both token and thread indices, we apply 3D RoPE (Su et al., 2024; Ma et al., 2025) which splits the head dimension: the first $d_t$ dimensions encode token position $t$, and the remaining $d_b = d - d_t$ dimensions encode block index $n$:
$$\tilde{\mathbf{Q}}_p, \tilde{\mathbf{K}}_p = \text{RoPE}_{3D}(\mathbf{Q}_p, \mathbf{K}_p; t, n)$$
We reshape from $(BN, L, \cdot)$ to $(B, NL, \cdot)$ to enable cross-block attention...Similar Articles
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
This paper introduces CATS, a cascaded adaptive tree speculation framework designed to accelerate LLM inference on memory-constrained edge devices by optimizing memory usage while maintaining high token acceptance rates.
LEAP: Unlocking dLLM Parallelism via Lookahead Early-Convergence Token Detection
This paper introduces LEAP, a training-free method to accelerate inference in Diffusion Language Models (dLLMs) by detecting early-converging tokens, reducing denoising steps by 30% without losing accuracy.
LongAct: Harnessing Intrinsic Activation Patterns for Long-Context Reinforcement Learning
LongAct proposes a saliency-guided sparse update strategy for improving long-context reasoning in LLMs by selectively updating weights associated with high-magnitude activations in query and key vectors, achieving ~8% improvement on LongBench v2.
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
This paper introduces CASPO, a framework for aligning token-level confidence with step-wise logical correctness in large reasoning models using iterative Direct Preference Optimization. It also proposes Confidence-aware Thought (CaT) for dynamically pruning uncertain reasoning branches during inference to improve reliability and efficiency.
Learning to reason with LLMs
OpenAI publishes an article exploring reasoning techniques with LLMs through cipher-decoding examples, demonstrating step-by-step problem-solving approaches and pattern recognition in language models.