Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

arXiv cs.CL 05/26/26, 04:00 AM Papers

attention-mechanism transformers efficiency interpretability sparsity grammar-guided nlp

Summary

This paper introduces a grammatically-guided sparse attention mechanism for Transformers, aiming to improve efficiency and interpretability by leveraging linguistic structure.

arXiv:2605.24518v1 Announce Type: new Abstract: The quadratic complexity of self-attention in Transformer models remains a significant bottleneck for processing long sequences and deploying large language models efficiently. For this approach, there has been significant research into Sparse Attention, and Deepseek Sparse Attention has combined various methods of creating segments of tokens to reduce the time complexity. This paper introduces a novel approach, Grammatically-Guided Sparse Attention, which constrains attention computations based on the grammatical roles of tokens. By leveraging Parts-of-Speech (POS) tags, attention masks are dynamically generated that enforce linguistically coherent connections between tokens, reducing the computational graph without sacrificing essential linguistic dependencies. Two masking strategies are proposed and evaluated: a hard mask that strictly allows only predefined grammatical interactions, and a soft mask that biases attention towards these interactions. The experiments, conducted on the SST-2 sentiment classification task using a DistilBERT-like architecture, demonstrate that Grammatically-Guided Sparse Attention maintains comparable accuracy to full attention while significantly reducing the theoretical computational overhead. Preliminary results show accuracy values of 0.8200 for hard masking and 0.8165 for soft masking, closely matching the 0.8200 of full attention, providing a path towards more efficient, interpretable, and linguistically-informed Transformer architectures.

Original Article

View Cached Full Text

Cached at: 05/26/26, 09:03 AM

# Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers
Source: [https://arxiv.org/abs/2605.24518](https://arxiv.org/abs/2605.24518)
Bibliographic Tools

## Bibliographic and Citation Tools

Bibliographic Explorer Toggle

Code, Data, Media

## Code, Data and Media Associated with this Article

Demos

## Demos

Related Papers

## Recommenders and Search Tools

About arXivLabs

## arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website\.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy\. arXiv is committed to these values and only works with partners that adhere to them\.

Have an idea for a project that will add value for arXiv's community?[**Learn more about arXivLabs**](https://info.arxiv.org/labs/index.html)\.

Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers

Similar Articles

Generative modeling with sparse transformers

@akshay_pachaar: 1) Sparse Attention It limits the attention computation to a subset of tokens by: - Using local attention (tokens atten…

GiLT: Augmenting Transformer Language Models with Dependency Graphs

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Adaptive Computation Depth via Learned Token Routing in Transformers

Submit Feedback

Similar Articles

Generative modeling with sparse transformers

@akshay_pachaar: 1) Sparse Attention It limits the attention computation to a subset of tokens by: - Using local attention (tokens atten…

GiLT: Augmenting Transformer Language Models with Dependency Graphs

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

Adaptive Computation Depth via Learned Token Routing in Transformers