Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation

arXiv cs.CL Papers

Summary

This paper introduces SemanticSeg, a large-scale dataset for semantic segmentation of long texts, and block distillation, a training framework that enables block attention models to approach full-attention performance, improving KV cache reuse in RAG and long-context scenarios.

arXiv:2605.15913v1 Announce Type: new Abstract: Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long-context scenarios such as Retrieval-Augmented Generation (RAG). However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self-contained blocks, and the inefficiency of existing block fine-tuning methods that risk degrading performance. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over 30k instances across 16 categories-including books, code, web text, and conversations with text lengths ranging from 2k to 32k. Using this dataset, we train a lightweight segmenter to automatically partition text into human-instinct-aligned blocks with controllable granularity. Second, we propose block distillation, a training framework that is more efficient than block fine-tuning, which uses a frozen full-attention teacher model to guide the block-attention student. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token-level loss weighting to focus learning on block-attention-sensitive tokens. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near-full-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention.
Original Article
View Cached Full Text

Cached at: 05/18/26, 06:34 AM

# Towards Generalization of Block Attention via Automatic Segmentation and Block Distillation
Source: [https://arxiv.org/html/2605.15913](https://arxiv.org/html/2605.15913)
Shuaiyi Li1 &Zhisong Zhang2 &Yan Wang3 &Lei Zhu3 &Dongyang Ma3 &Chenlong Deng5 &Yang Deng4 &Wai Lam1† \{sli, wlam\}@se\.cuhk\.edu\.hk,zhisong\.zhang@cityu\.edu\.hk1The Chinese University of Hong Kong,2City University of Hong Kong,3Tencent4Singapore Management University5Gaoling School of Artificial Intelligence, Renmin University of China

###### Abstract

Block attention, which processes the input as separate blocks that cannot attend to one another, offers significant potential to improve KV cache reuse in long\-context scenarios such as Retrieval\-Augmented Generation \(RAG\)\. However, its broader application is hindered by two key challenges: the difficulty of segmenting input text into meaningful, self\-contained blocks, and the inefficiency of existing block fine\-tuning methods that risk degrading performance\. To address these, we first construct SemanticSeg, a large and diverse semantic segmentation dataset containing over30​k30kinstances across 16 categories—including books, code, web text, and conversations—with text lengths ranging from2​k2kto32​k32k\. Using this dataset, we train a lightweight segmenter to automatically partition text into human\-instinct\-aligned blocks with controllable granularity\. Second, we propose block distillation, a training framework that is more efficient than block fine\-tuning, which uses a frozen full\-attention teacher model to guide the block\-attention student\. This framework integrates three novel components: block sink tokens to mitigate information loss at block boundaries, block dropout to leverage training signals from all blocks, and token\-level loss weighting to focus learning on block\-attention\-sensitive tokens\. Experiments across multiple models and benchmarks demonstrate that our segmenter outperforms heuristic and statistical baselines, and block distillation achieves near\-full\-attention performance under block attention, establishing a practical and scalable pathway for deploying block attention\.

## 1Introduction

Large language models \(LLMs\) have demonstrated remarkable capabilities in processing long\-context inputs\[Baiet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib22),[2024](https://arxiv.org/html/2605.15913#bib.bib23), Maharanaet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib6)\], enabling applications such as multi\-document question answering, coding, etc\. However, the standard full\-attention mechanism scales quadratically with sequence length, making inference on long inputs computationally expensive and memory\-intensive\. A significant source of this inefficiency is the context\-dependent nature of full attention: when identical context is paired with different prefixes, its key\-value \(KV\) states must be recomputed from scratch\. This leads to substantial waste of compute and energy, particularly in retrieval\-augmented generation \(RAG\) scenarios where overlapping document sets are repeatedly processed across queries\. To mitigate this, block attention\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]has emerged as a promising alternative\. By restricting self\-attention to independent blocks and allowing only a final aggregation block to attend globally, it eliminates cross\-block dependencies and facilitates the reuse of pre\-computed KV cache\. Nevertheless, its practical adoption is hindered by several obstacles\.

An important barrier in the application of block attention is segmentation, that is, how to divide the input sequence into separate blocks\. Existing approaches often rely on heuristic rules, such as splitting with newlines; however, such rules rarely generalize and are highly likely to break the semantic coherence of the inputs\. As demonstrated in Section[5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px1), such naive segmentation leads to performance degradation, highlighting the need for a semantic\-aware approach\. To address this, we propose a robust, data\-driven semantic segmenter capable of handling diverse input formats\. We take a data\-driven approach and construct a segmentation dataset \(SemanticSeg\), where each sample is segmented by the semantic meaning\. Using this dataset, we train a neural segmenter that can automatically produce adaptive and context\-aware boundaries, overcoming a major obstacle to the generalization of block attention\.

Another major challenge is effectively integrating block attention into existing LLMs\. While training\-free strategies such as Prompt Cache\[Gimet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib4)\]and Superposition prompting\[Merthet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib5)\]attempt to enable KV state reuse or parallel processing, such direct application of block attention into general domains suffers from severe performance degradation compared to full attention \(Table[3](https://arxiv.org/html/2605.15913#S5.T3)\)\. This indicates that these approaches are not sufficiently effective for supporting reliable block attention, further highlighting the necessity of specialized training\.\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]addressed this through “block fine\-tuning”, which trains models on both block and full attention patterns to balance specialized performance with general capability\. However, this approach is computationally expensive and generalizes poorly across diverse domains \(Section[5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px2)\)\. To address these limitations, we introduceBlock Distillation, a training framework designed for higher efficiency and signal density\. It incorporates three novel mechanisms:*block sink tokens*to counteract information loss at block boundaries,*block dropout*to exploit training signals from all blocks, and*token\-level loss weighting*to emphasize tokens that are most sensitive to block attention\.

To validate our approach, we conduct comprehensive experiments across multiple models and benchmarks\. To verify the effectiveness of our segmenter, we quantify the impact of segmentation on downstream performance and compare our segmenter against a range of heuristic and statistical segmentation baselines\. To evaluate our training framework, we rigorously testedBlock Distillation, evaluating its efficiency and performance in general domains and the specific contributions of its individual components\. Experimental results demonstrate that our segmenter consistently outperforms all baselines, andBlock Distillationpushes block\-attention performance close to the full\-attention upper bound while preserving or even improving full\-attention capability, establishing a practical and scalable pathway for deploying block attention in long\-context applications111[https://github\.com/Syon\-Li/Generalization\-of\-Block\-Attention/tree/main](https://github.com/Syon-Li/Generalization-of-Block-Attention/tree/main)\.

## 2Preliminary

We illustrate the idea of block attention with the example of RAG\. Consider a pool ofrrdocuments and two instances with overlapping retrieved contexts:

> Inst 1: Document \[ii\]; Document \[i\+1i\+1\]; …; Document \[i\+xi\+x\]; …; \[Queryqxq\_\{x\}\]\. Inst 2: Document \[jj\]; Document \[j\+1j\+1\]; …; Document \[j\+yj\+y\]; …; \[Queryqyq\_\{y\}\]\.

where the document sets intersect at\{i,…,i\+x\}∩\{j,…,j\+y\}=\{n,…,m\}\\\{i,\\dots,i\+x\\\}\\cap\\\{j,\\dots,j\+y\\\}=\\\{n,\\dots,m\\\}\. In standard full attention, the KV states for this intersection must be recomputed for each query because the attention mechanism is prefix\-dependent, resulting in significant computational overhead and energy waste\. However, if encoding can be performed independently for each document, the KV states for the intersection\{n,…,m\}\\\{n,\\dots,m\\\}become prefix\-agnostic and can be safely reused across disparate queries\. Block attention\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]formalizes this by partitioning the input into independent blocks\. Each block employs self\-attention restricted to its own tokens, ensuring that internal representations are decoupled from other blocks\. Only the final block \(typically the user query\) is permitted to utilize full attention, aggregating information from all preceding KV caches\.

The implementation of this method can be achieved easily via the following steps: 1\) Independently encoding each block except the last one; 2\) Computing the positional encoding for each token based on their position in the input text; 3\) Concatenating all pre\-computed KV states of the blocks and using them to compute the KV states for the final block\. However, the previous work\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]has demonstrated that the model cannot accommodate this pattern without training, which is also verified in this work \(section[5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px1)\)\. To cope with this challenge, they employ block fine\-tuning, which updates the parameters in both ways, one for block attention and one for full attention\. They claim this could enhance block\-attention performance while maintaining full attention capability\. However, despite its heavy updating scheme, it struggles to generalize to other domains \([5\.2](https://arxiv.org/html/2605.15913#S5.SS2)\)\.

## 3Automatic Segmentation

A major barrier to the general application of block attention is segmentation, i\.e\., given the input text, how to cut it into meaningful, self\-contained, and human\-instinct\-aligned blocks \(or chunks\) for later processing\. One may argue that the segmentation operation has a limited effect on the final performance, as the model may not understand the semantics in the same way as humans\. However, as proved in the section[5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px1), the segmentation plays an important role in the final performance\.

To enable general automatic segmentation, we adopt a data\-driven approach and first construct a semantic segmentation dataset, calledSemanticSeg\. Then we design the segmenter and its processing approach, and use the built dataset to train it\.

### 3\.1SemanticSeg

We diversify the source and length \(l∈\[2​k,32​k\]l\\in\[2k,32k\]\) of the dataset to facilitate the generalization of the segmenter\. For each input text, we follow step 1 in Fig\.[1](https://arxiv.org/html/2605.15913#S3.F1)to insert candidate cut tokens and use Gemini\-2\.5\-Pro to determine the final segmentation results\. SemanticSeg contains around 16 categories of segmentation data, with each category containing at least around2​k2kinstances\. The varying cut rates across categories can also help the segmenter learn distinct segmentation patterns\. More details of the dataset can be found in Appendix[B](https://arxiv.org/html/2605.15913#A2)\.

![Refer to caption](https://arxiv.org/html/2605.15913v1/x1.png)Figure 1:The segmentation process\. 1\. The candidate cut tokens are first inserted into the raw text via a simple rule \(newline in the example\)\. 2\. The initial text segments are fed into the segmenter, which outputs a binary probability distribution for each candidate cut token\. 3\. The segmenter can be applied recursively with different division thresholds to customize the segmentation granularity\. 4\. The segmentation for each candidate cut token is determined by the corresponding consecutive cut token\.
### 3\.2Segmentation

We construct the segmenter with a pre\-trained language model backbone \(Qwen3\-4B\-Instruct\-2507\) by adding a classification head consisting of two linear layers and an intermediate ReLU activation layer\. The segmentation process is presented in Fig\.[1](https://arxiv.org/html/2605.15913#S3.F1)\. For an input textTT, a set of candidate cut tokensC∈\{C0,C1,…,Cn\}C\\in\\\{C\_\{0\},C\_\{1\},\\dots,C\_\{n\}\\\}is first inserted into the input text via simple rules like the newline character\. This forms a series of initial text segments\{C0,T1,C1,T2,C2,…,Tn,Cn\}\\\{C\_\{0\},T\_\{1\},C\_\{1\},T\_\{2\},C\_\{2\},\\dots,T\_\{n\},C\_\{n\}\\\}, where\{T1,T2,…,Tn\}=T\\\{T\_\{1\},T\_\{2\},\\dots,T\_\{n\}\\\}=T\. The segmenter then takes this series of text segments, along with the candidate cut tokens, as input and outputs a binary probability distribution for each candidate cut token\. As the current candidate cut point cannot capture important segmentation information from its successive text segments \(For example, in Fig\.[1](https://arxiv.org/html/2605.15913#S3.F1), "<cut 3\>" cannot attend to its successive token "Method", which is a promising choice for final segmentation\), we use the hidden vector from the next candidate to determine the segmentation of the current candidate\.

A very important factor of the segmentation is the granularity, which determines the final number of blocks \(the parallel degree\) and how much information each block contains\. In the segmenter, two adjustable components could be used to control the segmentation granularity, which are the threshold value for the binary probability distribution and the recursion depth \(the number of times the segmenter is applied recursively, with each level splitting existing blocks further using a pre\-defined threshold\) in the segmenter\. The users can pair each level of recursion with a different threshold value\. Generally, the deeper recursion level can pair with a greater or equal threshold value\. During training, the threshold value is set to 0\.5, but we recommend 0\.2 ~0\.5 for the first level of recursion\.

## 4Block Distillation

Block fine\-tuning\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]trains the model under block and full attention to preserve full\-attention performance\. This is time‑consuming and hard to scale\. Therefore, we propose block distillation, which uses the original full\-attention model to guide block\-attention training\. With block distillation, we can safely bypass the heavy updating scheme of block fine\-tuning without degrading full\-attention performance while improving block\-attention performance \(section[5\.2](https://arxiv.org/html/2605.15913#S5.SS2)\)\.

The block distillation employs three novel components to facilitate the training\. They areblock sink tokensthat are used to mitigate abnormal patterns in block attention,block dropoutthat takes advantage of the non\-last block training signal, andtoken weightingapplied to the token dimension to the cross\-entropy loss\.

### 4\.1Block Sink Tokens

The previous investigation\[Zhanget al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib11)\]reveals that the attention patterns are extremely abnormal at the beginning of each block, leading to potential optimization instabilities\. We characterize this challenge aslost in block head\(section[5\.3](https://arxiv.org/html/2605.15913#S5.SS3.SSS0.Px1)\)\. To tackle this problem, we introduce a new token \(<\|b​l​o​c​k​\_​s​t​a​r​t\|\><\|block\\\_start\|\>\) called block sink token\. Following\[Xiaoet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib10)\], we duplicate the block sink token four times at the beginning of each block\. Hence, the final block\-attention version input follows the format\{b​l​s∗4,B1,b​l​s∗4,B2,…,b​l​s∗4,Bn\}\\\{bls\*4,B\_\{1\},bls\*4,B\_\{2\},\\dots,bls\*4,B\_\{n\}\\\}, whereb​l​sblsmeans the block sink token<\|b​l​o​c​k​\_​s​t​a​r​t\|\><\|block\\\_start\|\>andBr,r∈\[1,n\]B\_\{r\},r\\in\[1,n\]is the blocks partitioned by the segmenter\. In practice, we set the dropout rate to around 0\.6 for all the training\.

![Refer to caption](https://arxiv.org/html/2605.15913v1/x2.png)Figure 2:The block dropout\. A number of randomly selected blocks are forced to attend only the content within the block itself\. Note that the final block always follows the full\-attention pattern\.
### 4\.2Block Dropout

A fundamental requirement for block attention is the model’s ability to accurately retrieve information from the KV caches of all the blocks\. Existing fine\-tuning methods\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]are highly inefficient because they only optimize the model using signals from the final block, essentially ignoring the vast majority of the input sequence\. To address this signal sparsity problem, we introduce block dropout \(Fig\.[2](https://arxiv.org/html/2605.15913#S4.F2)\)\. This mechanism randomly selects a subset of context blocks for individual encoding \(blue in Fig\.[2](https://arxiv.org/html/2605.15913#S4.F2)\) and applies a KL divergence loss to all remaining non\-corrupted blocks \(orange\)\. By doing so, we force the model to learn from a much larger proportion of the text\. Formally, given an input sequencexx, a frozen teacher modelφ\\varphi, a student modelφs\\varphi\_\{s\}, and letℜ​\(x\)\\mathfrak\{R\}\(x\)denote the set of tokens within corrupted blocks\. The block dropout KL divergence is defined as:

KLx=DK​L\(pφ\(\{xi\|xi∉ℜ\(x\)\}\)\|\|pφs\(\{xi\|xi∉ℜ\(x\)\}\)\)\\displaystyle KL\_\{x\}=D\_\{KL\}\(p\_\{\\varphi\}\(\\\{x\_\{i\}\|x\_\{i\}\\notin\\mathfrak\{R\}\(x\)\\\}\)\\ \|\|\\ p\_\{\\varphi\_\{s\}\}\(\\\{x\_\{i\}\|x\_\{i\}\\notin\\mathfrak\{R\}\(x\)\\\}\)\)\(1\)

### 4\.3Token Weighting

Traditional cross\-entropy loss applies equal weights to the token dimension\. Such a mechanism potentially decreases the training effectiveness, as it contains no information on the different degrees of importance for each token\. Thus, we introduce the token weights that are employed on the token dimension of the cross\-entropy computation, similar to\[Liet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib9)\]\. Specifically, given a teacher modelφ\\varphiwhose weights are frozen, an inputxx, the token weights are computed as follows:

wx=m​a​x​\(C​E​\(φb​\(x\)\)−C​E​\(φ​\(x\)\),0\)×α\+β\\displaystyle w\_\{x\}=max\(CE\(\\varphi\_\{b\}\(x\)\)\-CE\(\\varphi\(x\)\),0\)\\times\\alpha\+\\beta\(2\)whereφb\\varphi\_\{b\}means a block\-attention forward pass in the teacher model andC​ECEis the cross\-entropy loss\. The token weights assign greater weight to tokens that have a relatively large difference in loss between the block\-attention and full\-attention forward passes, and shrink the loss scale of those insensitive tokens \(CE\(φb\(x\)\)−CE\(φ\(x\)≤0CE\(\\varphi\_\{b\}\(x\)\)\-CE\(\\varphi\(x\)\\leq 0\) toβ\\beta\(usually a value near 0\.1\)\. In this way, we can alleviate the noise in training and focus more on the learning of block\-attention capability\. We setα=0\.2,β=0\.1\\alpha=0\.2,\\beta=0\.1for Qwen series models andα=0\.5,β=0\.1\\alpha=0\.5,\\beta=0\.1for Llama series models in the later experiments\.

### 4\.4Training

The final training loss is the combination of the previously introduced components\. Specifically,

l​o​s​sx=C​E​\(φb​s​\(x\)\)×wx\+K​Lx\\displaystyle loss\_\{x\}=CE\(\\varphi\_\{bs\}\(x\)\)\\times w\_\{x\}\+KL\_\{x\}\(3\)whereφb​s\\varphi\_\{bs\}represents a block\-attention forward pass in the student model\. More training details have been put into Appendix[C\.2](https://arxiv.org/html/2605.15913#A3.SS2)\.

## 5Experiments

In this section, we conduct comprehensive experiments to verify the key arguments of this paper from two perspectives\. From the perspective of the segmentation: 1\) The degree to which segmentation influences downstream performance, and 2\) whether our segmenter offers a genuine improvement over other straightforward partition methods\. From the perspectives of block distillation: 1\) Whether block fine\-tuning is enough for the general domain application, 2\) whether the block distillation can help block\-attention performance approach that of full attention in the general domain, 3\) whether the block distillation affect the full attention performance, 4\) what is the efficiency gain brought by the block attention in inference, and 5\) the effectiveness of each component in block distillation\.

### 5\.1Experiment Settings

#### Benchmarks

We adopt two popular comprehensive benchmarks for evaluation, namely LongBench\[Baiet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib23)\]and LoCoMo\[Maharanaet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib6)\], and follow the exact procedures defined in the original papers, including the metrics, prompts, etc\.

#### Baselines

Unlike\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\], which needs to reproduce the whole SFT procedure, we implement the block distillation directly on chat models, including Qwen3\-4B\-Instruct\-2507, Llama\-3\.1\-8B\-instruct, Qwen3\-8B, and Qwen3\-14B\. For fair comparison, we also include the model from previous work\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]\. The details for the baselines are as follows:

- •Original\- The untouched official model released on HuggingFace\. This is the performance upper bound for the block attention model\. Our objective is to make the general performance of the block attention model approximate that of this model as closely as possible\.
- •Block\-Dist\- The model trained via our block distillation framework using our segmented data\.
- •Prompt Cache\- The Prompt Cache\[Gimet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib4)\]baseline, which enables training\-free reuse of the attention states across prompts\.
- •Superposition\- The Superposition Prompting baseline\[Merthet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib5)\]\. It allows LLMs to process the input documents in parallel paths, and prunes the irrelevant paths at the end\.
- •Tulu3\-SFT\- The original Llama\-3\.1\-Tulu\-3\-8B\-SFT model, which serves as the ceiling performance for the Tulu3 series of block attention baselines from\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]\.
- •Tulu3\-Block\-FT\- The block attention model trained by block fine\-tuning\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\], using the SFT dataset of Tulu3 and 20,000 samples of RAG data sampled from TriviaQA and 2WikiMultiHopQA\. We include this model to visualize the gap between its block\-attention performance and the full\-attention performance from Tulu3\-SFT in the general domain\.
- •Tulu3\-Block\-FT\-S\- Since the training data of Tulu3\-Block\-FT is partitioned by simple rules without the segmenter, we further train Tulu3\-Block\-FT using our data divided by the segmenter for fair comparison\.

Unless otherwise specified, "\- Full" indicates that the test is under full attention, and "\- Block" means evaluation is using block attention\.

Table 1:Results for different segmentation methods on Longbench\[Baiet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib23)\]\. For fair comparison, the parallel degree for all methods is aligned with that of the segmenter\.Table 2:Main results on Block\-FT\. " \- Full" means the evaluation is performed using full attention, and "\- Block" means the evaluation is performed under block attention\.Table 3:Main results on LongBench\[Baiet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib23)\]\.Table 4:Main results on LoCoMo\[Maharanaet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib6)\]\.

### 5\.2Main Results

#### The impact of segmentation

In this section, we verify two points: How much impact does the segmentation have on the performance, and is the segmenter really better than other simple segmentation baselines? These two points are verified by comparing the segmenter with other segmentation methods \(Table[1](https://arxiv.org/html/2605.15913#S5.T1)\)\. Specifically, we include two sets of baselines, one of which is segmentation inheuristics:

- •Random\- Segmentation in random with the set of candidate cut points to be the space\.
- •Average\- Segmentation in average with the set of candidate cut points to be the space\.
- •Punctuation\- Segmentation using the set of candidate cut points to be the sentence\-ending punctuation\.
- •Random candidate\- Random segmentation with the set of candidate cut points to be in step 2 of Fig\.[1](https://arxiv.org/html/2605.15913#S3.F1)\.
- •Average candidate\- Average segmentation with the set of candidate cut points to be in step 2 of Fig\.[1](https://arxiv.org/html/2605.15913#S3.F1)\.

Another set is segmentation instatistics:

- •Loss\- The segmentation with the chunked topk cross\-entropy loss value222We use the chunked topk to prevent cut points from clustering together\.
- •Entropy\- The segmentation with the chunked topk token entropy value\.

To exclude the impact of segmentation during training, we do not use models trained via block distillation\. Instead, we apply these methods on two models, namely, Qwen3\-8B \- the original chat model, and Tulu3\-Block\-FT \- the block attention model trained via block fine\-tuning\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]\. The parallel degree of all segmentation methods is aligned with that of the segmenter\.

The performance variance of different segmentation baselines from Tulu3\-Block\-FT is noticeable, demonstrating the impact of segmentation methods\. Since Qwen3\-8B is not trained on any segmented data, the performance variance of Qwen3\-8B is generally lower than that of Tulu3\-Block\-FT\. Although these two models do not use any training data partitioned by the segmenter, they both perform the best on average when the input is processed by the segmenter \(in comparison with other segmentation baselines333Note that the random candidate and the average candidate generally follow the segmentation methods used by Tulu3\-Block\-FT\.\)\. This significantly demonstrates the effectiveness of the segmenter\.

#### Generalization failure of block FT

The previous work\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]only tests the block attention under the RAG scenario\. Hence, how the block fine\-tuning performs for the general domain remains an unverified point\. Therefore, we first test the block attention model trained in the previous work\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]to find out whether it can achieve similar performance to the full attention model\. The results are shown in Table[2](https://arxiv.org/html/2605.15913#S5.T2)\. The Tulu3\-Block\-FT model has degraded block attention performance compared to that of the Tulu3\-SFT \- Full\. There are two possibilities for this: one is the incompetence of the block fine\-tuning, and the other is the difference in the training data that is segmented via simple rules\. To eliminate the influence of the training data, we further train the "Tulu3\-SFT" model using our training data partitioned by our segmenter, the model denoted as "Tulu3\-Block\-FT\-S"\. The results show that the block\-attention performance of this model \("Tulu3\-Block\-FT\-S \- Block"\) has a relatively noticeable gap compared to the full\-attention performance of "Tulu3\-SFT"\. Therefore, we can conclude that the source of the performance gap is the block fine\-tuning, and it is not enough for the generalization of block attention\.

#### Effectiveness of block distillation

In this section, we verify the effectiveness of block distillation in both block attention and full attention\. The results are shown in Table[3](https://arxiv.org/html/2605.15913#S5.T3)and Table[4](https://arxiv.org/html/2605.15913#S5.T4)\. The Prompt Cache\[Gimet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib4)\]and Superposition prompting\[Gimet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib4)\]considerably degrade the model performance compared to vanilla full\-attention\. This aligns with the findings in\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]and section[5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px1)that the model struggles to adapt the block attention pattern without training\. For all models, both the block\-attention and full\-attention performance of block distillation achieves near\-equivalent performance to that of full\-attention from the original model\. Therefore, the block distillation can help improve the block\-attention capability of the model while preserving its full\-attention ability\.

#### Efficiency

We measure both training and inference efficiency \(Table[6](https://arxiv.org/html/2605.15913#S5.T6)\)\. For training, block distillation requires25,859\.9​m​s25,859\.9msper step, which is approximately 26% faster than Block\-FT \(34,941\.1​m​s34,941\.1ms\), which demonstrates the efficiency of block distillation over block fine\-tuning\. For inference, we measure the time\-to\-first\-token \(TTFT\) for vanilla full\-attention and block attention across sequence lengths from8​k8kto64​k64k\. Block attention consistently achieves lower TTFT than vanilla full\-attention, and the absolute gain grows with sequence length: the TTFT reduction increases from57\.9​m​s57\.9msat8​k8kto3,149\.7​m​s3,149\.7msat64​k64k\.

### 5\.3Ablation study & Analysis

Table 5:The ablation study results\.#### The lost in block head

Given such a segmented example from the Longbench\[Baiet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib23)\]synthetic task:

> "Block 1 \- Paragraph 1:…; Block 2 \- Paragraph 2:…; Block 3 \- Paragraph 3:…;…; Block n \- The following is an abstract:…, Please enter the number of the paragraph that the abstract is from\. The answer format must be like "Paragraph 1", "Paragraph 2", etc\. The answer is: "

where the model is required to retrieve the information from the head \(beginning\) of the block\. We find that the block attention model has serious trouble in dealing with this type of query\. We name this phenomenonlost in block head\. We believe that the source of the problem is linked to the findings in a previous investigation\[Zhanget al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib11)\], which shows that the L2\-norm of the key states is extremely small at the head of the blocks\. Interestingly, despite the enormously greater amount of training data used compared to this work, the block\-FT model trained in\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]shows a considerable collapse in this synthetic task \(Tulu3\-Block\-FT \- Block in Table[3](https://arxiv.org/html/2605.15913#S5.T3)\), indicating that simple fine\-tuning may not be enough\.

Table 6:The efficiency measurement\.
#### Block sink tokens

To tackle the lost in block head problem, we introduce a new special token "<\|b​l​o​c​ks​t​a​r​t\|\><\|block\_\{s\}tart\|\>", which is padded before each block’s beginning to increase the L2\-norm of the key states\. We conduct an experiment to verify its effectiveness in Table[5](https://arxiv.org/html/2605.15913#S5.T5)\. The performance shows a considerable decrease in the Synthetic task, which aligns with the discussion in section[5\.3](https://arxiv.org/html/2605.15913#S5.SS3.SSS0.Px1)\. In addition, the few\-shot and single\-doc QA tasks also experience a noticeable drop, implying that the block sink token not only helps the understanding of the information in the block head but also the later actual block content\.

#### Block dropout

We verify the effectiveness of the block dropout and KL divergence on the Longbench benchmark\[Baiet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib23)\]\(Table[5](https://arxiv.org/html/2605.15913#S5.T5)\) via forcing computing the KL loss for the last block only\. The results demonstrate a great decrease in the few\-shot and synthetic tasks when the block dropout is absent during training, suggesting it helps alleviate the lost in block head problem\. The KL loss component has also been verified by completely wiping it out\. The results experience a tremendous drop in single\-doc QA, few\-shot, and synthetic tasks, suggesting its important role in helping learn block\-attention patterns\.

#### Token weights

The effectiveness of the token weights is verified by using the usual mean reduction for the cross\-entropy loss\. Although adopting the cross\-entropy without weights increases the Multi\-doc QA performance, it contrarily degrades the performance in single\-doc QA and synthetic tasks\.

## 6Related work

Block attention\[Maet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib24)\]partitions input sequences into independent blocks to enable efficient KV cache reuse and parallel prefilling, but its broader adoption is hindered by costly block fine‑tuning and limited generalization beyond RAG settings\. Prompt Cache\[Gimet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib4)\]explores training‑free modular attention reuse across prompts, while Superposition prompting\[Merthet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib5)\]processes documents in parallel paths and prunes irrelevant ones\. However, both approaches suffer from severe performance degradation when directly applied under block‑attention patterns without dedicated training\. In contrast, our work introduces a learned semantic segmenter and an efficient block distillation framework that overcome these limitations, achieving block‑attention performance close to the full‑attention upper bound while preserving full‑attention capability\.

## 7Conclusion

In this work, we address two fundamental obstacles that prevent the broader adoption of block attention: the absence of a principled segmentation method and the inefficiency of existing block fine\-tuning\. To tackle the first, we construct SemanticSeg, a large\-scale multi\-domain segmentation dataset, and train a lightweight neural segmenter that partitions text into semantically coherent blocks with controllable granularity\. To overcome the second, we propose Block Distillation, an efficient training framework that incorporates three novel components—block sink tokens, block dropout, and token\-level loss weighting—to effectively transfer full\-attention capability to the block\-attention pattern\. Extensive experiments on LongBench and LoCoMo across multiple model families demonstrate the effectiveness of the segmenter and Block Distillation\.

## References

- Y\. Bai, X\. Lv, J\. Zhang, H\. Lyu, J\. Tang, Z\. Huang, Z\. Du, X\. Liu, A\. Zeng, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2024\)LongBench: A bilingual, multitask benchmark for long context understanding\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 3119–3137\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.172),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.172)Cited by:[§1](https://arxiv.org/html/2605.15913#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.15913#S5.SS1.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2605.15913#S5.SS3.SSS0.Px1.p1.1),[§5\.3](https://arxiv.org/html/2605.15913#S5.SS3.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2605.15913#S5.T1),[Table 3](https://arxiv.org/html/2605.15913#S5.T3)\.
- Y\. Bai, S\. Tu, J\. Zhang, H\. Peng, X\. Wang, X\. Lv, S\. Cao, J\. Xu, L\. Hou, Y\. Dong, J\. Tang, and J\. Li \(2025\)LongBench v2: towards deeper understanding and reasoning on realistic long\-context multitasks\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 3639–3664\.External Links:[Link](https://aclanthology.org/2025.acl-long.183/)Cited by:[§1](https://arxiv.org/html/2605.15913#S1.p1.1)\.
- Y\. Chen, S\. Qian, H\. Tang, X\. Lai, Z\. Liu, S\. Han, and J\. Jia \(2024\)LongLoRA: efficient fine\-tuning of long\-context large language models\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=6PmJoRfdaK)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.3.2.2)\.
- A\. Chevalier, J\. Geng, A\. Wettig, H\. Chen, S\. Mizera, T\. Annala, M\. J\. Aragon, A\. R\. Fanlo, S\. Frieder, S\. Machado, A\. Prabhakar, E\. Thieu, J\. T\. Wang, Z\. Wang, X\. Wu, M\. Xia, W\. Xia, J\. Yu, J\. Zhu, Z\. J\. Ren, S\. Arora, and D\. Chen \(2024\)Language models as science tutors\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 8310–8335\.External Links:[Link](https://proceedings.mlr.press/v235/chevalier24a.html)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.6.5.2)\.
- J\. Dong, B\. Feng, D\. Guessous, Y\. Liang, and H\. He \(2024\)Flex attention: A programming model for generating optimized attention kernels\.CoRRabs/2412\.05496\.External Links:[Link](https://doi.org/10.48550/arXiv.2412.05496),[Document](https://dx.doi.org/10.48550/ARXIV.2412.05496),2412\.05496Cited by:[§C\.2](https://arxiv.org/html/2605.15913#A3.SS2.p1.1)\.
- I\. Gim, G\. Chen, S\. Lee, N\. Sarda, A\. Khandelwal, and L\. Zhong \(2024\)Prompt cache: modular attention reuse for low\-latency inference\.InProceedings of the Seventh Annual Conference on Machine Learning and Systems, MLSys 2024, Santa Clara, CA, USA, May 13\-16, 2024,P\. B\. Gibbons, G\. Pekhimenko, and C\. D\. Sa \(Eds\.\),External Links:[Link](https://proceedings.mlsys.org/paper%5C_files/paper/2024/hash/a66caa1703fe34705a4368c3014c1966-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2605.15913#S1.p3.1),[3rd item](https://arxiv.org/html/2605.15913#S5.I1.i3.p1.1),[§5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px3.p1.1),[§6](https://arxiv.org/html/2605.15913#S6.p1.1)\.
- P\. Hsu, Y\. Dai, V\. Kothapalli, Q\. Song, S\. Tang, S\. Zhu, S\. Shimizu, S\. Sahni, H\. Ning, and Y\. Chen \(2024\)Liger kernel: efficient triton kernels for LLM training\.CoRRabs/2410\.10989\.External Links:[Link](https://doi.org/10.48550/arXiv.2410.10989),[Document](https://dx.doi.org/10.48550/ARXIV.2410.10989),2410\.10989Cited by:[§C\.2](https://arxiv.org/html/2605.15913#A3.SS2.p1.1)\.
- D\. Kocetkov, R\. Li, L\. B\. Allal, J\. Li, C\. Mou, Y\. Jernite, M\. Mitchell, C\. M\. Ferrandis, S\. Hughes, T\. Wolf, D\. Bahdanau, L\. von Werra, and H\. de Vries \(2023\)The stack: 3 TB of permissively licensed source code\.Trans\. Mach\. Learn\. Res\.2023\.External Links:[Link](https://openreview.net/forum?id=pxpbTdUEpD)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.13.12.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.14.13.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.15.14.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.16.15.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.17.16.2)\.
- W\. Kryscinski, N\. Rajani, D\. Agarwal, C\. Xiong, and D\. Radev \(2022\)BOOKSUM: A collection of datasets for long\-form narrative summarization\.InFindings of the Association for Computational Linguistics: EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7\-11, 2022,Y\. Goldberg, Z\. Kozareva, and Y\. Zhang \(Eds\.\),Findings of ACL,pp\. 6536–6558\.External Links:[Link](https://doi.org/10.18653/v1/2022.findings-emnlp.488),[Document](https://dx.doi.org/10.18653/V1/2022.FINDINGS-EMNLP.488)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.2.1.2)\.
- S\. Li, Z\. Zhang, Y\. Deng, C\. Deng, T\. Fang, H\. Zhang, H\. Mi, D\. Yu, and W\. Lam \(2025\)InComeS: integrating compression and selection mechanisms into llms for efficient model editing\.CoRRabs/2505\.22156\.External Links:[Link](https://doi.org/10.48550/arXiv.2505.22156),[Document](https://dx.doi.org/10.48550/ARXIV.2505.22156),2505\.22156Cited by:[§4\.3](https://arxiv.org/html/2605.15913#S4.SS3.p1.2)\.
- A\. Lozhkov, L\. Ben Allal, L\. von Werra, and T\. Wolf \(2024\)FineWeb\-edu: the finest collection of educational content\.Hugging Face\.External Links:[Link](https://huggingface.co/datasets/HuggingFaceFW/fineweb-edu),[Document](https://dx.doi.org/10.57967/hf/2497)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.11.10.2)\.
- D\. Ma, Y\. Wang, and T\. Lan \(2025\)Block\-attention for efficient prefilling\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=7zNYY1E2fq)Cited by:[§1](https://arxiv.org/html/2605.15913#S1.p1.1),[§1](https://arxiv.org/html/2605.15913#S1.p3.1),[§2](https://arxiv.org/html/2605.15913#S2.p1.11),[§2](https://arxiv.org/html/2605.15913#S2.p2.1),[§4\.2](https://arxiv.org/html/2605.15913#S4.SS2.p1.4),[§4](https://arxiv.org/html/2605.15913#S4.p1.1),[5th item](https://arxiv.org/html/2605.15913#S5.I1.i5.p1.1),[6th item](https://arxiv.org/html/2605.15913#S5.I1.i6.p1.1),[§5\.1](https://arxiv.org/html/2605.15913#S5.SS1.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px1.p1.3),[§5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px2.p1.1),[§5\.2](https://arxiv.org/html/2605.15913#S5.SS2.SSS0.Px3.p1.1),[§5\.3](https://arxiv.org/html/2605.15913#S5.SS3.SSS0.Px1.p1.3),[§6](https://arxiv.org/html/2605.15913#S6.p1.1)\.
- A\. Maharana, D\. Lee, S\. Tulyakov, M\. Bansal, F\. Barbieri, and Y\. Fang \(2024\)Evaluating very long\-term conversational memory of LLM agents\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,L\. Ku, A\. Martins, and V\. Srikumar \(Eds\.\),pp\. 13851–13870\.External Links:[Link](https://doi.org/10.18653/v1/2024.acl-long.747),[Document](https://dx.doi.org/10.18653/V1/2024.ACL-LONG.747)Cited by:[§1](https://arxiv.org/html/2605.15913#S1.p1.1),[§5\.1](https://arxiv.org/html/2605.15913#S5.SS1.SSS0.Px1.p1.1),[Table 4](https://arxiv.org/html/2605.15913#S5.T4)\.
- T\. Merth, Q\. Fu, M\. Rastegari, and M\. Najibi \(2024\)Superposition prompting: improving and accelerating retrieval\-augmented generation\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,R\. Salakhutdinov, Z\. Kolter, K\. A\. Heller, A\. Weller, N\. Oliver, J\. Scarlett, and F\. Berkenkamp \(Eds\.\),Proceedings of Machine Learning Research,pp\. 35507–35527\.External Links:[Link](https://proceedings.mlr.press/v235/merth24a.html)Cited by:[§1](https://arxiv.org/html/2605.15913#S1.p3.1),[4th item](https://arxiv.org/html/2605.15913#S5.I1.i4.p1.1),[§6](https://arxiv.org/html/2605.15913#S6.p1.1)\.
- K\. Paster, M\. D\. Santos, Z\. Azerbayev, and J\. Ba \(2024\)OpenWebMath: an open dataset of high\-quality mathematical web text\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=jKHmjlpViu)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.7.6.2)\.
- Z\. Shen, T\. Tao, L\. Ma, W\. Neiswanger, Z\. Liu, H\. Wang, B\. Tan, J\. Hestness, N\. Vassilieva, D\. Soboleva, and E\. P\. Xing \(2023\)SlimPajama\-dc: understanding data combinations for LLM training\.CoRRabs/2309\.10818\.External Links:[Link](https://doi.org/10.48550/arXiv.2309.10818),[Document](https://dx.doi.org/10.48550/ARXIV.2309.10818),2309\.10818Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.10.9.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.12.11.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.8.7.2),[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.9.8.2)\.
- H\. Trivedi, N\. Balasubramanian, T\. Khot, and A\. Sabharwal \(2022\)9835 musique: multihop questions via single\-hop question composition\.Trans\. Assoc\. Comput\. Linguistics10,pp\. 539–554\.External Links:[Link](https://doi.org/10.1162/tacl%5C_a%5C_00475),[Document](https://dx.doi.org/10.1162/TACL%5FA%5F00475)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.4.3.2)\.
- D\. Wu, H\. Wang, W\. Yu, Y\. Zhang, K\. Chang, and D\. Yu \(2025\)LongMemEval: benchmarking chat assistants on long\-term interactive memory\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=pZiyCaVuti)Cited by:[Table 7](https://arxiv.org/html/2605.15913#A2.T7.3.1.1.1.1.1.5.4.2)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,External Links:[Link](https://openreview.net/forum?id=NG7sS51zVF)Cited by:[§4\.1](https://arxiv.org/html/2605.15913#S4.SS1.p1.5)\.
- P\. Xu, W\. Ping, X\. Wu, C\. Xu, Z\. Liu, M\. Shoeybi, and B\. Catanzaro \(2025\)ChatQA 2: bridging the gap to proprietary llms in long context and RAG capabilities\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,External Links:[Link](https://openreview.net/forum?id=cPD2hU35x3)Cited by:[§C\.2](https://arxiv.org/html/2605.15913#A3.SS2.p2.2)\.
- Z\. Yang, P\. Qi, S\. Zhang, Y\. Bengio, W\. W\. Cohen, R\. Salakhutdinov, and C\. D\. Manning \(2018\)HotpotQA: A dataset for diverse, explainable multi\-hop question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),pp\. 2369–2380\.External Links:[Link](https://doi.org/10.18653/v1/d18-1259),[Document](https://dx.doi.org/10.18653/V1/D18-1259)Cited by:[§C\.2](https://arxiv.org/html/2605.15913#A3.SS2.p2.2)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2023\)ReAct: synergizing reasoning and acting in language models\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,External Links:[Link](https://openreview.net/forum?id=WE%5C_vluYUL-X)Cited by:[§D\.2](https://arxiv.org/html/2605.15913#A4.SS2.p1.1)\.
- Z\. Zhang, Y\. Wang, X\. Huang, T\. Fang, H\. Zhang, C\. Deng, S\. Li, and D\. Yu \(2025\)Attention entropy is a key factor: an analysis of parallel context encoding with full\-attention\-based pre\-trained language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,W\. Che, J\. Nabende, E\. Shutova, and M\. T\. Pilehvar \(Eds\.\),pp\. 9840–9855\.External Links:[Link](https://aclanthology.org/2025.acl-long.485/)Cited by:[Appendix A](https://arxiv.org/html/2605.15913#A1.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.15913#S4.SS1.p1.5),[§5\.3](https://arxiv.org/html/2605.15913#S5.SS3.SSS0.Px1.p1.3)\.

## Appendix ALimitations

There are several limitations of this work that we need to discuss\.

#### Block dropout in pretraining

So far, we have only explored employing the block dropout in the post\-training phase\. However, we believe it is possible to scale it to the pre\-training phase to further improve the capability\. At that time, the cross\-entropy loss can also be applied to the non\-corrupted block\.

#### Thinking mode verification

We do not verify the thinking mode function of the block attention model since the training data does not contain any thinking content\. However, based on the investigation from previous work\[Zhanget al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib11)\], the thinking mode has high potential for being effective in block attention\.

#### Model type & size

Due to resource limitations, we can only scale the model size to 14B\. The effectiveness of block distillation to block attention under a larger model size requires further exploration\. Additionally, the experiments are only conducted for dense models; the compatibility of block attention with other popular structures, like MoE, requires further discussion\.

#### RL compatibility

Reinforcement learning is a powerful tool for improving the model’s reasoning and agentic capabilities nowadays\. However, none of the existing works investigates the compatibility of block attention with RL\. Therefore, we think this is a promising direction for further exploration\. If they are proven compatible, then the cost of commercial models would be considerably reduced\.

#### Agentic application

This paper does not verify the effectiveness of block attention in agentic scenarios\. However, a big advantage of block attention is the KV cache reuse across prompts\. If it can be applied to agents, it could save much waste on reencoding, and the cost would be massively decreased\.

## Appendix BThe segmentation dataset

Table 7:The statistics for the SemanticSeg dataset\. "Comprehensive" means all the existing code categories in The stack\[Kocetkovet al\.,[2023](https://arxiv.org/html/2605.15913#bib.bib20)\]\. The cut rate is calculated asthe number of authenticated cut tokens÷the number of candidate cut tokens\.\\text\{the number of authenticated cut tokens\}\\div\\text\{the number of candidate cut tokens\}\.The specific statistics and the resources forSemanticSegdataset are shown in Table[7](https://arxiv.org/html/2605.15913#A2.T7)\. The prompt used for generating the segmentation data is as follows:Prompt used by Gemini\-2\.5\-Pro forSemanticSegYou are a master editor specializing in text segmentation for parallel processing by Large Language Models\. Your goal is to segment a given text into the largest possible, yet fully self\-contained, semantic chunks\. The text is pre\-segmented with candidate markers ‘<cut 0\>‘ to ‘<cut N\>‘\.Your task is to identify the correct boundaries by selecting a subset of these markers\. A chunk is "self\-contained" if a person \(or an LLM\) can understand it completely in isolation, without needing to read the preceding chunk\.\*\*The Core Principle:\*\* The ideal cutting point is where the topic COMPLETELY changes, and the new chunk can be understood without any ambiguity\. If a chunk starts with a sentence that makes you ask "wait, who/what/why are they talking about?", then the cut is wrong and the chunks must be merged\.\*\*CRITICAL HEURISTICS FOR COHERENCE \(Rules for what NOT to do\):\*\* You must merge chunks avoid these common errors\. A cut is BAD if: 1\. \*\*It splits a sentence\.\*\* The text after a "<cut \*\>" marker cannot be the grammatical continuation of a sentence from before the marker\. Example: "\.\.\.he demanded", "Do the Maquas dare\.\.\." \-\> BAD CUT\. Merge the two chunks to form a complete sentence\. 2\. \*\*It breaks direct anaphora \(reference\)\.\*\* The new chunk cannot start with words or phrases that directly refer to the immediate preceding text\. Examples: "I’ll try those suggestions\.", "Because of this\.\.\.", "That’s a great point\.", "He/she then\.\.\.", "As they\.\.\." \-\> BAD CUT\. The reference \("those suggestions", "this", "That", "He/She", "They", etc\.\) is in the previous chunk\. Merge the reference chunk with its immediately previous chunk\. 3\. \*\*It separates a question from its answer, or a statement from its immediate response\.\*\* Dialogue is a continuous flow\. A multi\-turn exchange on a single, evolving topic should be in ONE chunk\. Example: A user asks for coffee shops, the assistant lists them, the user asks about Wi\-Fi at one of them\. This entire thread about finding a coffee shop is ONE semantic unit and should be in the SAME chunk\. 4\. \*\*It separates functionally dependent content\.\*\* Text that references an element \(like a figure or table\) must be in the same chunk as that element’s description\. Example: Text discussing "Figure 1" must be in the same chunk as the caption for "Figure 1"\. 5\. \*\*It separates the chunk content from its header\.\*\* The chunk content and its corresponding header should always stay in the same chunk\. Example: "Chapter V\\n", "Mary did something bad\.\.\.\\n" \-\> BAD CUT\. The header \("Chapter", "Section", "Part", etc\.\) and its content should be in the SAME chunk\. Merge the header chunk with its content chunk\. 6\. \*\*It breaks the semantic clarity\.\*\* Each chunk should remain understandable to a downstream model or retrieval system\. Examples: For Python, "def hello\_word\(a:str\)\\n", " print\("hello world\!"\)" \-\> BAD CUT\. The function head and its content should not be separated; "class myclass\(\)\\n", " def \_\_init\_\_\(self, config\):" \-\> BAD CUT\. The class head and its content should not be separated\. Merge them\.\*\*Cutting steps you should refer:\*\* 1\. Read the entire text to get a general sense of its structure \(narrative, dialogue, academic paper, code, etc\.\)\. 2\. Iterate through each candidate marker from ‘<cut 1\>‘ to ‘<cut N\-1\>‘\. 3\. For each marker, examine the two chunks immediately before and after it\. Ask yourself whether this marker violates the heuristics above and whether it fulfills the core principle above\. 4\. If the marker violates any of the above heuristics or does not fulfill the above core principle, identify this as a BAD cut point, move to the next marker and repeat step 3\. Otherwise, identify this marker as a GOOD cut point\. 5\. Based on the identified GOOD cut points, construct your final output\.\*\*NOTE:\*\* \(1\)\. Usually, the final number of chunks should be greater than or equal to 5\. \(2\)\. For the length of each chunk, it should contain no more than 350 candidate markers\. \(3\)\. Try your best to fulfill both \(1\) and \(2\)\. If you can’t, you can compromise between \(1\) and \(2\) depend on the circumstances\. \(4\)\. Extremely large chunk is not allowed\! Putting extremely large content into a chunk \(For example, putting a whole paper into a chunk\) to avoid BAD cuts is unacceptable\!\*\*Output Format:\*\* Follow these rules exactly: 1\. Chunk boundaries must sit only on the exact strings ‘<cut \*\>‘\. 2\. Output \*\*only\*\* lines of the form and \*\*nothing else\*\*: ‘chunk <number\>: <cut \*\> \-\-\- <cut \*\*\>‘ where ‘<cut \*\>‘ and ‘<cut \*\*\>‘ are the beginning and ending cutting point markers of the current chunk\. 3\. Number chunks sequentially starting with 1\. 4\. The beginning marker of the first chunk must be ‘<cut 0\>‘ and the final marker of the final chunk must be ‘<cut N\>‘\. 5\. Do not output any duplicated chunks\.Text to segment:

## Appendix CTraining details

### C\.1Segmenter

We use all categories of data for the training of the segmenter\. Note that for the code category, we only use the comprehensive subset\. The segmenter structure is an autoregressive model backbone with a new cut head\. The cut head consists of two linear layers and an intermediate ReLU activation layer\. We use the learning rate2​e−5−2​e−62e^\{\-5\}\-2e^\{\-6\}with a cosine decay strategy\.

### C\.2Block distillation

We use the flex attention\[Donget al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib3)\]framework and liger\-kernel\[Hsuet al\.,[2024](https://arxiv.org/html/2605.15913#bib.bib2)\]to implement the block attention during training\. For all models, we adopt a learning rate2​e−6−2​e−72e^\{\-6\}\-2e^\{\-7\}and a cosine decay strategy\.

For the training dataset, we adopt HotpotQA\[Yanget al\.,[2018](https://arxiv.org/html/2605.15913#bib.bib8)\]and a subset from ChatQA2\[Xuet al\.,[2025](https://arxiv.org/html/2605.15913#bib.bib7)\]\. We use the trained segmenter to divide the samples from the subset of ChatQA2 \(called ChatQA2Seg\) for the training dataset\. Overall, the number of training data is around180​k180k, and the length is less than32​k32k\.

## Appendix DApplication scenario analysis

### D\.1Coding agent

Consider an LLM\-powered coding assistant managing a large repository\. A developer first asks Query A about python1\.py and python2\.py, and later asks Query B about python3\.py and python4\.py\. Both queries share no overlapping files except the project’s global config\.

Under full attention, the KV cache is context\-dependent: it is tied to the exact sequence of documents in a specific prompt\. If the assistant encodes the retrieved files for Query A, the cached KV states are affected by the specific file order and Query A’s tokens\. To serve Query B with a different file subset or order, the cache cannot be partially reused; the entire prompt, including potentially many unchanged files, must be re\-encoded\. This leads to significant and wasteful recomputation\.

Block attention decouples this dependency by treating each document as an independently encoded block\. The KV states of python1\.py, python2\.py, python3\.py, python4\.py, etc\., are stored separately\. When Query B arrives, only the required blocks are fetched and composed with the new query block — no re\-encoding of unchanged documents is needed\. This modular, prompt\-level KV cache reuse is the fundamental efficiency gain of block attention: it replaces rigid, monolithic caching with flexible, composable caching, dramatically reducing redundant prefilling in dynamic, long\-context scenarios typical of coding agents and multi\-document workflows\.

### D\.2Multi\-turn agentic workflows

Consider an LLM‑based research agent tasked with "Investigate recent advances in mechanistic interpretability and summarize key findings\." The agent operates in a multi‑turn ReAct‑style loop\[Yaoet al\.,[2023](https://arxiv.org/html/2605.15913#bib.bib1)\]: at each step, it decides which tool to use \(search, browse, or code execution\), processes the results, and plans the next action\. The specific turns are as follows:

> Turn 1: Search for "mechanistic interpretability survey 2025" and retrieve Paper A, Paper B, and Paper C\. Turn 2: Browse Paper A and extract the main research landscape\. Turn 3: Browse Paper B and understand the evidence and limitations\. Turn 4: Browse Paper C and record its experimental design and conclusions\. Turn 5: Browse Paper C and record its experimental design and conclusions\. Turn 6: Revisit Paper A to reuse its taxonomy for structuring the final report\. Turn 7: Revisit Paper B to compare its evidence with the results in Paper C\. Turn 8: Extract key sentences from A, B, and C to build a comparison table\. Turn 9: Compose the final report with specific citations to all three papers\.

Under full attention, each new turn concatenates the growing history with newly retrieved documents and the agent’s next action\. Even if papers A, B, and C remain identical across turns, their KV states are embedded in a monolithic context that changes with every search result and reasoning step\. Their cached states from previous turns are rarely reusable because the surrounding prompt context and document ordering have shifted, forcing repeated re‑encoding of the same documents\.

With block attention, each permanent element \(e\.g\., the system prompt, Paper A, Paper B, Paper C, etc\) is encoded as an independent block and cached once\. Each agent turn only requires encoding the new query, any new search results, and the fresh reasoning step, while the static document blocks are fetched directly from the cache and combined in a modular way\. Over many turns and many documents, this eliminates repeated prefilling of unchanged content, yielding compounding efficiency savings and substantially lowering latency and compute cost in long‑running agentic workflows\.

## Appendix ESocietal impacts

By boosting long\-context efficiency and KV cache reuse, our work lowers compute cost and energy use, making advanced AI more accessible and sustainable for coding assistants, multi\-document analysis, and agentic applications\. However, cheaper inference may lower barriers for misuse \(e\.g\., disinformation\) and accelerate deployment without labor safeguards\. Efficiency benefits may not transfer equally across architectures or languages, risking a widening gap between well\-supported and underrepresented settings\. Responsible deployment and monitoring are encouraged\.

Similar Articles

Diffusion Model as a Generalist Segmentation Learner

Hugging Face Daily Papers

This paper introduces DiGSeg, a framework that repurposes pretrained diffusion models for state-of-the-art semantic and open-vocabulary segmentation by leveraging latent space conditioning and text-guided alignment.

Block-Based Double Decoders

arXiv cs.LG

Proposes block-based double decoders, a novel transformer architecture using doubly-causal block-based attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, achieving strong scaling performance and reduced KV-cache memory.