@hardmaru: The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLM…
Summary
This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.
View Cached Full Text
Cached at: 05/08/26, 05:37 PM
The human brain is incredibly efficient because it only activates the specific neurons needed for a thought. Modern LLMs naturally try to do this too (> 95% of neurons in feedforward layers stay silent for any given word), but our hardware punishes them for it. One of the most frustrating paradoxes in deep learning: making a model do less math often makes it run slower. Why? Because unstructured sparsity introduces irregular memory access, and GPUs are built for predictable, dense blocks of math. We teamed up with @NVIDIA to try to fix this hardware mismatch. Instead of forcing the GPU to adapt to the sparsity, we built a “Hybrid” format that reshapes the sparsity to fit the GPU. Our sparsity format (TwELL) dynamically routes the 99% of highly sparse tokens through a fast path, and uses a dense backup matrix as a safety valve for the rare, heavy tokens. Through TwELL and a new set of custom CUDA kernels for both LLM inference and training, we translated theoretical sparsity into actual wall-clock speedups: >20% faster training and inference on H100 GPUs, while also cutting energy consumption and memory requirements. Paper: https://arxiv.org/abs/2603.23198 Blog: https://pub.sakana.ai/sparser-faster-llms/… Code: https://github.com/SakanaAI/sparser-faster-llms…
Sparser, Faster, Lighter Transformer Language Models
Source: https://arxiv.org/html/2603.23198 \correspondingauthor
Edoardo Cetin ([email protected]), Emilio Castillo ([email protected])
Stefano Peluchetti*Sakana AIEmilio Castillo*NVIDIA *Core contributorsAkira NaruseNVIDIA *Core contributorsMana MurakamiNVIDIA *Core contributorsLlion JonesSakana AI
Abstract
Scaling autoregressive large language models (LLMs) has driven unprecedented progress but comes with vast computational costs. In this work, we tackle these costs by leveraging unstructured sparsity within an LLM’s feedforward layers, the components accounting for most of the model parameters and execution FLOPs. To achieve this, we introduce a newsparse packing formatand a set ofCUDA kernelsdesigned to seamlessly integrate with the optimized execution pipelines of modern GPUs, enabling efficient sparse computation during LLM inference and training. To substantiate our gains, we provide a quantitative study of LLM sparsity, demonstrating that simple L1 regularization can induce over 99% sparsity with negligible impact on downstream performance. When paired with our kernels, we show that these sparsity levels translate into substantial throughput, energy efficiency, and memory usage benefits that increase with model scale. We will release all code and kernels under an open-source license to promote adoption and accelerate research toward establishing sparsity as a practical axis for improving the efficiency and scalability of modern foundation models.
1Introduction
Figure 1:Comparison of ELL with our new TwELL and Hybrid sparse formats designed for LLM inference and training, respectively.Large Language Models (LLMs) have revolutionized natural language processing, demonstrating unprecedented capabilities in text generation, reasoning, and knowledge retrieval(openai2023gpt4;gemini). The core component driving these advancements has been massive computational investments into scaling the seminal Transformer architecture, with current LLMs reaching hundreds of billions of parametersvaswani2017attention;gpt2;gpt3. However, with increasingly larger models requiring vast computational resources for both inference and training, there is a growing need for fundamental efficiency improvements to ensure the present and future sustainability of the field(schwartz2020green;luccioni2023bloom).
One seminal avenue for improving the efficiency of machine learning models is sparsity(lecun1989optimal;han2015learning;hoefler2021sparsity). For modern overparameterized LLMs, recent investigations have even documented that sparsity arises naturally in their feed-forward layers, with only a small fraction of hidden neurons activated for any given token(zhang-etal-2022-moefication;li2023lazyneuronphenomenonemergence). Thus, with feed-forward computation accounting for over two-thirds of the parameters and over 80% of the total FLOPs in larger models(llms-flops-params), sparsity seemingly offers a natural opportunity for concrete computational savings.
However, a frustrating paradox has blocked progress: despite performing far less theoretical computation, official kernels implementing sparse operations can often run slower than dense operations on modern GPUs. The culprit is a fundamental mismatch between unstructured sparsity and GPU architectures, whose hardware and software stacks have been heavily optimized for dense computation patterns(BLAS-library;nvidia-cublas;nvidia-cublasdx;nvidia-cutlass). In contrast, heterogeneous workloads together with the overheads from materializing and managing sparse indices have been critical challenges preventing generalized computational savings. Due to these challenges, previous attempts to realize efficiency gains have relied on considerable deviations from modern training recipes and have yet to see practical adoption(dejavu_contectual_sparsity_related;q_sparse_top_k_related).
In this work, we introduce new kernels designed for modern NVIDIA GPUs to bridge this gap and leverage unstructured sparsity to deliver substantial speedups while reducing memory requirements and energy consumption during both LLM inference and training. Our kernels build on Tile-wise ELLPACK (TwELL), a new packing format for sparse data that can be naturally materialized in the epilogue of highly-optimized matrix multiplication kernels, removing a canonical bottleneck of prior packing schemes. Starting from TwELL, our inference kernels fuse multiple matrix-multiplications into a single optimized pipeline that minimizes computation, while our training kernels further reduce the sparse representation to a hybrid format that trivializes storage costs of intermediate activations.
To substantiate our gains, we provide a quantitative study of LLM sparsity across model scales, demonstrating that mild levels of L1 regularization can achieve over 99% sparsity with negligible impact on downstream performance. Through our new kernels, we show these sparsity levels translate into increasing benefits with larger parameter counts in terms of processing throughput, energy savings, and memory requirements – delivering up to 20.5% and 21.9% speedups in forward execution and training for models with billions of parameters. We analyze how these benefits specifically come from the computational unevenness across network layers and natural language data, which can be inherently leveraged in sparse models. By providing a clear demonstration of its practical benefits, we hope this work will help establish sparsity as a new axis for improving the scalability and performance of modern foundation models.
In summary, our main contributions are threefold:
- 1.We introduce and share new CUDA kernels for inference and training, with several key innovations to make sparse LLMs cheaper, faster, and lighter on modern GPUs.
- 2.We provide a quantitative analysis showing that high levels of unstructured sparsity can be achieved using mild L1 regularization with negligible compromises on performance.
- 3.We demonstrate and analyze how our kernels leverage such sparsity with substantial and increasing benefits at larger scales for LLMs with billions of parameters.
2Large Language Models, Feedforward Blocks, and Sparsity
While the original transformer used a simple 2-layer feedforward block, this module has seen considerable evolution since its conception(vaswani2017attention). The most recent architectures have largely converged to a 3-layer gated design that has consistently proven empirical superiority when evaluated at large scale(shazeer2020glu). While in this work we release kernels for both the original and gated blocks, we focus our main text on the newer design and defer to AppendixCfor further discussions, results, and comparisons with the older variant.
2.1Feed-forward Modules as Sparse Knowledge Stores
A modern gated feed-forward block(shazeer2020glu)is parameterized by three weight matricesWg∈ℝK×NW_{g}\in\mathbb{R}^{K\times N},Wu∈ℝK×NW_{u}\in\mathbb{R}^{K\times N}, andWd∈ℝN×KW_{d}\in\mathbb{R}^{N\times K}representing the gate, up, and down projection matrices, respectively. In our notation, we useMMto denote the feed-forward block’s effective batch size over all batched sequences and positions,KKto denote its input/output dimensions, andNNto denote its hidden expanded dimension. The gate and up projection matrices both process the block’s input batchx∈ℝM×Kx\in\mathbb{R}^{M\times K}and produce the up and gate activationshgh_{g}andhu∈ℝM×Nh_{u}\in\mathbb{R}^{M\times N}, where the symmetry between the two is broken with a non-linear activation functionσ\sigma. These projections are then combined with elementwise multiplication into a unified hidden representationh∈ℝM×Nh\in\mathbb{R}^{M\times N}, before being projected back to their original dimensionality using the down projection weightsWdW_{d}, to compute the block’s outputsy∈ℝM×Ky\in\mathbb{R}^{M\times K}:
hu=xWu,hg=σ(xWg),h=hu⊙hg,y=hWd.h_{u}=xW_{u},\quad h_{g}=\sigma(xW_{g}),\quad h=h_{u}\odot h_{g},\quad y=hW_{d}.(1)Since the hidden dimensionNNis typically much larger thanKK, feed-forward blocks can often account for most of the model’s parameters and FLOPs. We note that a common conceptualization of these architectural components is that of adynamic key-value memory(geva-etal-2021-transformer;dai-etal-2022-knowledge). In this mental model, the inner products betweenxxand the columns ofWgW_{g}andWuW_{u}inducekeyshh, while the rows ofWdW_{d}are seen asvaluesacting as memory slots that can be dynamically retrieved based on the input.
2.2Simple Ingredients for Training Sparse LLMs
We employ a simple recipe to induce varying levels of sparsity in the feed-forward activations, making minimal deviations from established architectures and training objectives. First, we use the ReLU as the activation function of choice following the gate projections. Second, we add a simple L1 loss to the standard cross-entropy with a tunable coefficientL1L_{1}to promote sparsity across the model’sLLlayers:
L1×1L∑l=1L1MN∑m=1M∑n=1N|hl[m,n]|.L_{1}\times\frac{1}{L}\sum_{l=1}^{L}\frac{1}{MN}\sum_{m=1}^{M}\sum_{n=1}^{N}\lvert h^{l}[m,n]\rvert.(2)We note that many recent LLM architectures have deviated from using ReLUs in favor of smoother activation functions such as SiLU, with minor but consistent benefits(shazeer2020glu;llama2;qwen2). In AppendixC, we provide direct empirical comparisons between these choices and also refer to orthogonal studies in the recent literature showing that domain-specific performance differences can be bridged with targeted training techniques(mirzadeh2023relu_apple_finetune;lomeli2025stochasticactivations).
3Making Sparse LLMs Fast
We introduce new CUDA kernels for inference and training that leverage unstructured sparsity to efficiently rework the computation in the feed-forward blocks of an LLM. The algorithms underlying our kernels build on TwELL, a new sparse format specifically designed for seamless kernel fusion to realize the inherent throughput and memory benefits of sparsity with minimal overheads. In this section, we describe the core components and advantages of our new kernels with algorithmic descriptions that summarize their logic at the level of individual cooperative thread arrays (CTAs). We refer to AppendixAfor code listings and more detailed design discussions of the thread-level CUDA implementations for H100 GPUs.
Algorithm 1Algorithmic description of gate projection with our matmul kernel with TwELL storage 1:Parameters:Tile sizes
Tn,TmT_{n},T_{m}, compression ratio C
2:Input:Dense
x∈ℝM×Kx\!\in\!\mathbb{R}^{M\times K},
Wg∈ℝK×NW_{g}\!\in\!\mathbb{R}^{K\times N},
3:Output:Sparse
hv∈ℝM×N/Ch_{v}\!\in\!\mathbb{R}^{M\times N/C},
hI∈ℕM×N/Ch_{I}\!\in\!\mathbb{N}^{M\times N/C},
hnz∈ℕM×NTh_{nz}\!\in\!\mathbb{N}^{M\times N_{T}} 4:for alltiles starting at
(m0,n0)(m_{0},n_{0})in parallel across CTAsdo
5:
S←x[m0:m0+Tm,:]Wg[:,n0:n0+Tn]S\leftarrow x[m_{0}{:}m_{0}{+}T_{m},:]\;W_{g}[:,n_{0}{:}n_{0}{+}T_{n}] 6:for
r←0r\leftarrow 0…
Tm−1T_{m}{-}1do
7:
m←m0+rm\leftarrow m_{0}+r{global row index}
8:
z←0z\leftarrow 0{running count of non-zeros in tile}
9:for
c←0c\leftarrow 0…
Tn−1T_{n}{-}1do
10:if
(S[r,c]>0)(S[r,c]>0)then
11:
n←n0/C+zn\leftarrow n_{0}/C+z{global TwELL column index}
12:
hI[m,n]←n0+ch_{I}[m,\,n]\leftarrow n_{0}+c{store non-zero index}
13:
hv[m,n]←S[r,c]h_{v}[m,\,n]\leftarrow S[r,c]{store non-zero value}
14:
z←z+1z\leftarrow z+1{increment non-zero count}
15:endif
16:endfor
17:
hnz[m,n0/Tn]←zh_{nz}[m,\,n_{0}/T_{n}]\leftarrow z{store final count of non-zeros}
18:endfor
19:endfor
Algorithm 2Algorithmic description of fused up and down projections from gate activations in the TwELL format 1:Parameters:Tile size
TnT_{n}, compression ratio C
2:Input:Sparse
hv∈ℝM×N/Ch_{v}\!\in\!\mathbb{R}^{M\times N/C},
hI∈ℕM×N/Ch_{I}\!\in\!\mathbb{N}^{M\times N/C},
hnz∈ℕM×NTh_{nz}\!\in\!\mathbb{N}^{M\times N_{T}}; dense
x∈ℝM×Kx\!\in\!\mathbb{R}^{M\times K},
Wu∈ℝK×NW_{u}\!\in\!\mathbb{R}^{K\times N},
Wd∈ℝN×KW_{d}\!\in\!\mathbb{R}^{N\times K} 3:Output:Dense
y∈ℝM×Ky\!\in\!\mathbb{R}^{M\times K} 4:for all
m∈π(0..M−1)m\in\pi(0..M{-}1)in parallel across CTAsdo
5:
xm←x[m,:]x_{m}\leftarrow x[m,:];
ym←0y_{m}\leftarrow 0 6:for
t←0t\leftarrow 0…
NT−1N_{T}-1do
7:
z←hnz[m,t]z\leftarrow h_{nz}[m,t] 8:for
c←0c\leftarrow 0…
z−1z-1do
9:
n←hI[m,t×Tn/C+c]n\leftarrow h_{I}[m,\,t\times T_{n}/C{+}c]{non-zero column index}
10:
wu=Wu[:,n]w_{u}=W_{u}[:,n]{
nn-th column of
WuW_{u}}
11:
u←(xm⋅wu)u\leftarrow(x_{m}\cdot w_{u}){sparse
hu[m,n]h_{u}[m,n]element}
12:
wd←Wd[n,:]w_{d}\leftarrow W_{d}[n,:]{
nn-th row of
WdW_{d}}
13:
ym←ym+(hv[m,t×Tn/C+c]×u)wdy_{m}\leftarrow y_{m}+(h_{v}[m,\,t\times T_{n}/C+c]\times u)\;w_{d} 14:endfor
15:endfor
16:
y[m,:]←ymy[m,:]\leftarrow y_{m} 17:endfor
3.1Sparse Formats and Kernels
The ELLPACK format (ELL) is considered the state-of-the-art for fast and efficient sparse matmuls(ellpack_original). This format was leveraged in some of the earliest GPU implementations of sparse algebra(first_sparse_gpu_impl), with more recent work focused on developing packing and sorting variants for better performance(sell_c_standard_on_gpus_sem;sell_c_standard_on_gpus_impl). As shown in part a. of Figure1, anM×NM\times Nmatrixhhin the ELL format is stored as two padded matriceshvh_{v}andhIh_{I}of sizeM×NnzM\times N_{nz}with the non-zero values ofxxand their column indices packed at the beginning of each row. This format prioritizes downstream usability over storage, padding the rows up to the maximum number of nonzero elementsNnzN_{nz}for efficient retrieval.
The main logic in most matmul kernels to performy=hWy=hWwith ELL, is to launch different parallel accumulations for each rowm=0,…,M−1m=0,\dots,M-1of the sparse matrixhhusing a set number of threads. In each accumulation, the kernel iterates forn=0,…,Nnz−1n=0,...,N_{nz}-1times, loading each column indexi=hI[i,j]i=h_{I}[i,j]and valuev=hv[i,j]v=h_{v}[i,j]ofhh, and multiplying it with theK−K-dimensional row of the dense weightW[i,:]W[i,:]. The key advantage of this format is that only a fraction of the weight columns and input values need to be processed, skipping the remaining zeros. To further reduce data access and computation, some later extensions like ELLPACK-R(ell_nnzs_ellpack_r)also store the number of non-zeros in each row in a separate vectorhnzh_{nz}.
3.2TwELL, a Sparse Data Format for Kernel Fusion
An effective predominant design for modern kernel pipelines is to maximize operator fusion and avoid unnecessary global memory accesses in order to best leverage the high compute throughput of modern NVIDIA GPUs. To this end, in a gated feed-forward block where sparsity patterns are determined by the gate activationshgh_{g}, prior sparse formats such as ELL suffer a major drawback. In essence, representinghgh_{g}with ELL requires first accessing all elements in every row to count, compare, and align the non-zero values and indices. However, existing matmul kernels for dense inputs rely on parallelizing computation across small 2D tilesTm×TnT_{m}\times T_{n}of the outputs, computed independently in separate CTAs. Thus, obtaining the gate activations directly in the ELL format from the non-sparse inputs cannot be done in the same kernel ofhg=ReLU(xW)h_{g}=\operatorname{ReLU}(xW)without introducing expensive synchronization among different CTAs. In contrast, launching a separate kernel to do the conversion inherently introduces non-trivial overheads that concretely limit attainable throughput gains of the whole computation.
To address these limitations, we introduce Tile-wise ELLPACK (TwELL). As illustrated in part b. of Figure1, rather than focusing on whole rows, TwELL divides the columns ofhgh_{g}in groups of horizontal 1D tiles of sizeTT. Within each group of columns, TwELL stores the non-zero values present and their indices in a local ELL-based packing format, with the data of each row aligned at the beginning of each horizontal tile. This results in two matrices containing locally aligned valueshv∈RM×N/Ch_{v}\in R^{M\times N/C}and indiceshI∈RM×N/Ch_{I}\in R^{M\times N/C}, whereCCis acompression factorset such thatT/CT/Cis higher than the maximum number of non-zeros in any tile to avoid storage overflow. In our implementation of TwELL, we also store an additional matrix with the number of non-zero elementshnz∈RM×NTh_{nz}\in R^{M\times N_{T}}to facilitate further computations, with as many columns as total tilesNT=⌈N/T⌉N_{T}=\lceil N/T\rceil. While inherently less expensive to derive, the main advantage of TwELL over ELL is actually ease of materialization following a modern tiled matmul: by setting the horizontal tiling dimensions to match,T=TnT=T_{n}, the TwELL format can be recovered in the same kernel performinghg=ReLU(xW)h_{g}=\operatorname{ReLU}(xW)before storing the outputs to DRAM. Fusing the two operations removes the requirement of performing additional kernel spawns, memory reads, or synchronization steps, leading to a natural integration into existing LLM pipelines.
3.3Kernels for TwELL Construction and Fast, Fused Inference
In Algorithm1, we provide pseudocode to summarize the logic of our CUDA matmul kernel storing the sparse outputs in the TwELL format (lines 6-18). Given the output distribution patterns of tensor core operations, we obtain the memory addresses to store the packed non-zero valueshvh_{v}and their indiceshIh_{I}by keeping a local non-zero count that only requires warp-level synchronization. While not an inherent requirement of TwELL, storing the number of non-zeros in each tile,hnzh_{nz}, allows us to forego the overhead from initializinghIh_{I}with any “padding” value and from the additional control logic of checking validity in future usages. While omitted from Algorithm1, we leverage fast asynchronous TMA reads and writes by first caching the dense inputs and sparse TwELL outputs to shared memory. We also pipeline computation and global memory accesses with a persistent cooperative design similar to the one in CUTLASS(nvidia-cutlass).
For inference, we introduce a single additional kernel to perform the rest of the computation in the feedforward block, leveraging the gate activations stored in the TwELL format to efficientlyfusethe up and down projections together. This kernel, summarized in Algorithm2, is launched on a grid made of single warp CTAs each processing a different rowmmof the input activationsxx. Minimizing the size of each CTA serves the purpose of maximizing concurrency and L2-hits across the grid, as non-zero activations tend to have high correlation within input sequences. The fused matmuls are executed by traversing over the sparsified activations with two nested for loops: the first one statically-unrolled over the number of column tiles (line 6) and the second one dynamically iterating over the corresponding number of non-zeros in each tile (line 8). For each non-zero activation at indexnn, the CTA collectively loads thenthn^{th}row ofWuW_{u}and column ofWdW_{d}to perform a dot product, followed by a scalar-vector product and accumulating its results (lines 9-13) – corresponding to the following computation:
y[m,:]=∑t=0NT−1∑c=0hnz[m,t]−1hv[m,t×Tn/C+c]⏟hvnon-zero value(x[m,:]⋅Wu[:,n])⏟huelementWd[n,:]⏟Wdrow,wheren=hI[m,t×Tn/C+c]⏟hInon-zero index.y[m,:]\!=\!\sum_{t=0}^{\scriptscriptstyle N_{T}\!-\!1}\sum_{c=0}^{\scriptscriptstyle h_{nz}[m,t]\!-\!1}\underbrace{h_{v}[m,\,t\times T_{n}/C\!+\!c]}_{\small h_{v}\text{ non-zero value}}\underbrace{\left(x[m,:]\cdot W_{u}[:,n]\right)}_{\small h_{u}\text{ element}}\underbrace{W_{d}[n,:]}_{\small W_{d}\text{ row}},\text{ where }n=\underbrace{h_{I}[m,\,t\times T_{n}/C\!+\!c]}_{\small h_{I}\text{ non-zero index}}.(3)Implicitly materializing thehuh_{u}values only inside the kernel serves to further reduce DRAM access to maximize throughput. Together, the kernels in our inference pipeline align the core principles of tiling and operator fusion into a single execution flow, harnessing the computational advantages of sparsity while minimizing its canonical overheads.
3.4Hybrid Conversion for Efficient Storage
During training, memory becomes a key bottleneck for throughput as large intermediate activations and optimizer states are needed for backpropagation. Here, sparsity provides a natural opportunity to tackle these bottlenecks by trivializing intermediate storage costs and accelerating gradient computations. However, directly using TwELL with a high compression ratio or other ELL-based formats for this purpose inherently relies on the maximum number of non-zerosNnzN_{nz}to be known ahead of time and strictly small. However, as we will illustrate in Section4, we find that these conditions are practically never met during LLM training as sparsity patterns exhibit significant non-uniformity across different tokens, with the maximum number of non-zeros often orders of magnitude larger than the average.
We overcome these limitations by first converting the TwELL activations to ahybridsparse format and introducing a new set of custom kernels designed specifically for memory-efficient training. As illustrated part c. of Figure1, our format dynamically partitions and stores the rows ofhgh_{g}either in an aggressively compact ELL matrixhgs∈RMs×Nnz^h^{s}_{g}\in R^{M^{s}\times N_{\hat{nz}}}or a dense backuphgd∈RMd×Nh^{d}_{g}\in R^{M^{d}\times N}. The partitioning logic simply routes the rows ofhgh_{g}based on their non-zero counts, which are cheaply computed from the locally aligned TwELL tiles. Our hybrid format also maintains a lightweight array of column indiceshI∈RMs×Nnz^h_{I}\in R^{M^{s}\times N_{\hat{nz}}}matching the size of the sparsified ELL matrix, and a simple binary vector indicating the storage location of each rowhb∈RMh_{b}\in R^{M}. In practice, we find that we can setNnz^N_{\hat{nz}}over an order-of-magnitude lower thanNNwith minimal overflow intohgdh^{d}_{g}, avoiding stringent ELL requirements while still trivializing memory and computation during the rest of the training step.
3.5Kernels for Lightweight Efficient Training
After materializinghgh_{g}in our hybrid format from the pre-activationsxWgxW_{g}, we design custom kernels to perform efficient hybrid-to-dense and dense-to-hybrid matmuls. We directly use these kernels to execute the rest of the forward pass, computinghu=xWuh_{u}=xW_{u}andy=hWdy=hW_{d}. Unlike inference, during training we execute the up and down projections in separate steps, allowing us to efficiently store the sparsified hidden states and minimize recomputation in the backward pass. In Algorithm3, we outline the logic of the hybrid-to-dense matmul for a generic inputhhand weightWW, with the dense-to-hybrid variant also following the same general structure. Our approach combines a typical ELL kernel, with each CTA processing individual rows of the outputyy(lines 4-13), and a traditional tiled kernel using Tensor cores for the dense backup rows (lines 14-17). During the sparse portion of the matmul computation, we opt to statically-unroll the accumulation up to the maximum number of non-zerosNnz^N_{\hat{nz}}for each row. Moreover, we also statically pre-allocate the dense backup portions of all the activations based on the sparsity statistics observed during training. We note that these design choices introduce minimal extra computation and storage costs, which are largely offset by avoiding dynamic overheads.
During the backward pass, we retrieve the sparsified activations together with the L1 and output gradients∇y\nabla y, allowing us to backpropagatewithout performing any expensive dense computation. This is achieved using two additional kernels that support efficient injection of L1 gradients into a given sparsity pattern and efficient transposition of our hybrid format for future coalesced accesses. We first use the stored sparsity pattern ofhhto obtain its gradients via our efficient dense-to-hybrid matmul∇yWdT\nabla yW^{T}_{d}, followed by the L1 injection. With∇h\nabla havailable, we recover the rest of the input and weight gradients with direct applications of our hybrid-to-dense and transposed kernels:
∇hu\displaystyle\nabla h_{u}=∇h⊙hg,∇hg=∇h⊙hu,\displaystyle=\nabla h\odot h_{g},\quad\nabla h_{g}=\nabla h\odot h_{u},(4)∇Wu=x⊤∇\displaystyle\nabla W_{u}=x^{\top}\nablahu,∇Wg=x⊤∇hg,∇Wd=h⊤∇y,\displaystyle h_{u},\quad\nabla W_{g}=x^{\top}\nabla h_{g},\quad\nabla W_{d}=h^{\top}\nabla y,\quad∇x=∇huWu⊤+∇hgWg⊤.\displaystyle\nabla x=\nabla h_{u}\,W_{u}^{\top}+\nabla h_{g}\,W_{g}^{\top}.Crucially, our execution logic reflects a deliberate design choice: rather than aggressively fusing individual operators, the training kernel pipeline is structured around the training step in its entirety. In this setting, the hybrid format minimizes backward computation and memory overheads, allowing us to avoid dense calculations while remaining robust to the highly non-uniform sparsity patterns that make ELL-based approaches traditionally brittle.
4Experimental Results
Figure 2:Training curves of LLMs across L1 regularization levels.Algorithm 3Algorithmic description of matmul from input activations in the hybrid format
1:Input:
Tm,Tk,TnT_{m},T_{k},T_{n},
h:=(hs,hd,hI,hb)h:=(h^{s},h^{d},h_{I},h_{b}),
W∈ℝK×NW\!\in\!\mathbb{R}^{K\times N} 2:Output:
y∈ℝM×Ny\!\in\!\mathbb{R}^{M\times N} 3:
πs←{m:hb[m]=0}\pi_{s}\leftarrow\{m:h_{b}[m]=0\};
πd←{m:hb[m]=1}\pi_{d}\leftarrow\{m:h_{b}[m]=1\} 4:for all
ms∈0..Ms−1m_{s}\in 0..M^{s}{-}1in paralleldo
5:
m←πs[ms]m\leftarrow\pi_{s}[m_{s}]{global row index}
6:
ym←0y_{m}\leftarrow 0{row accumulator}
7:for
j←0j\leftarrow 0…
Nnz^−1N_{\hat{nz}}{-}1do
8:
n←hI[ms,j]n\leftarrow h_{I}[m_{s},j]{non-zero column index}
9:
v←hs[ms,j]v\leftarrow h^{s}[m_{s},j]{sparse value}
10:
ym←ym+v⋅W[n,:]y_{m}\leftarrow y_{m}+v\cdot W[n,:]{sparse row update}
11:endfor
12:
y[m,:]←ymy[m,:]\leftarrow y_{m} 13:endfor
14:for alltiles starting at
(m0,n0)(m_{0},n_{0})in paralleldo
15:
S←hd[m0:m0+Tm,:]W[:,n0:n0+Tn]S\leftarrow h^{d}[m_{0}{:}m_{0}{+}T_{m},:]\;W[:,n_{0}{:}n_{0}{+}T_{n}] 16:
y[πd[m0:m0+Tm],n0:n0+Tn]←Sy[\pi_{d}[m_{0}{:}m_{0}{+}T_{m}],\,n_{0}{:}n_{0}{+}T_{n}]\leftarrow S 17:endfor
4.1Training and Evaluation Settings
We provide quantitative results evaluating the performance and efficiency of LLMs at different sparsity levels and scales. Our models are based on the “Transformer++” architecture, common to recent LLMs such as Qwen and Llama(llama2;qwen2)with the gated feedforward blocks described in Section2. We train our models just above the chinchilla-optimal number of tokens for each model size(chinchilla), using the fineweb dataset(fineweb). We default to a context length of 2048, a batch size of 1M tokens, and the AdamW optimizer with a weight decay of 0.1 and a cosine schedule(adamw). Other hyperparameters, such as the hidden size and total number of layers, are based on the model size and chosen based on modern practices(chinchilla). We note that our sparse models use the same training hyperparameters as our non-sparse baselines, as the addition of L1 regularization in the feedforward blocks did not seem to affect other choices.
To measure model performance, we use cross-entropy scores and seven different common downstream tasks assessing logic and reasoning capabilities after pretraining(bench_1_arc;bench_2_hellaswag;bench_3_openbook_qa;bench_4_piqa;bench_5_winogrande;bench_6_commonsenseqa). In this Section, we focus on aggregated performance metrics for conciseness, and we refer to AppendixDfor the full per-task breakdowns. To measure efficiency, we analyze the throughput gains at different sparsity levels when integrating our training and inference kernels by recording execution times, memory requirements, and energy consumption at each stage. Across our experiments, we keep a fixed sequence length of 2048 and vary the micro batch size based on the available memory. Unless otherwise specified, we train and collect our results on a single node of eight H100 PCIe GPUs, a commonly available infrastructure in current listings of cloud providers and scientific clusters. We refer to AppendixBfor further details on our training and evaluation settings, together with full hyperparameters specific to each of the considered model sizes.
Figure 3:Downstream accuracy and sparsity statistics of LLMs across L1 regularization levels.
Figure 4:Forward pass speedups and energy savings from our sparse LLM inference kernels across L1 regularization levels.
4.2More Efficient LLMs with Unstructured Sparsity
Figure 5:Training speedups and peak memory reduction from our sparse LLM training kernels across L1 regularization levels.Performance across sparsity levels.We start by evaluating the effect of introducing different levels of L1 regularization on the performance and sparsity of a 1.5B parameter model. In particular, we consider eight values for theL1L_{1}coefficient, ranging from no regularization (L1=0L_{1}=0) to the point where less than a single neuron on average remains activated after training (L1=1×10−4L_{1}=1\times 10^{-4}). In Figure2, we show the training curves of the different models, while in Figure3we report downstream task performance together with the final number of non-zero activations averaged across the feed-forward blocks. While our 1.5B model has a hidden feedforward dimensionality of 5632, we find that the non-regularized model already attains more than 20% sparsity with only 911 neurons activated. Moreover, consistently withmirzadeh2023relu_apple_finetune, we find that introducing small levels of regularization already pushes the average number of non-zeros orders of magnitude lower but with high variations across different tokens and layers. In particular, even at the highest regularization point, we find that a small fraction of tokens still excite several hundred neurons, indicating a reallocation of capacity. Despite this adaptivity, performance-wise, we do start seeing some performance degradation below 0.5% of activated neurons. Nonetheless, our results suggest that smaller levels of regularization do not visibly hinder capacity beyond the weight decay already induced by the AdamW optimizer: up untilL1=3×10−5L_{1}=3\times 10^{-5}, we record essentially no drop in task performance and a negligible increase of final cross-entropy within 2% of the unregularized baseline.
Table 1:Comparison of performance and efficiency statistics of sparse LLMs leveraging our kernels with traditional models.Leveraging sparsity for faster and lighter LLMs.We contrast our performance results by analyzing the efficiency improvements from integrating our kernels at different sparsity levels. In Figure4, we provide the average relative speedups and total energy savings recorded during forward execution through our LLMs. Across all considered sparsity levels aboveL1=0L_{1}=0, we find that our inference kernels lead to visible throughput gains ranging up to 30%. These throughput gains are compounded by nearly 3% less GPU power draw aboveL1=3×10−5L_{1}=3\times 10^{-5}, resulting in even higher energy savings. In Figure5, we also show the average relative speedups and peak memory reduction with our training kernel. In line with our inference kernels, the speedups recorded throughout training significantly increase up to 24% with sparser models. Furthermore, the peak GPU memory required for training decreases by more than 24% even for the lowest considered sparsity level, reducing hardware barriers for efficient training at billion-parameter scales (we refer to AppendixDfor results on an RTX6000). Taken together, we believe our results provide compelling evidence that specialized targeted kernels can make sparsity a new viable axis for the design of modern LLMs, leading to significant efficiency improvements across their full lifecycle.
Sparsity across model scales. We analyze how model scale affects the performance and efficiency of sparse LLMs. For this analysis, we setL1=2×10−5L_{1}=2\times 10^{-5}based on our earlier results on the 1.5B model, which we recommend as a conservative threshold to avoid any significant performance degradation. In Table1, we compare the performance and efficiency of sparse and non-sparse LLMs at the chinchilla-optimality boundary – ranging from a 0.5B model trained on 10B tokens to a 2B model trained on 40B tokens. Consistent with our earlier results, we find no performance drops beyond random deviations for all scales when mild L1 regularization is introduced. Furthermore, we find that LLMs become increasingly effective at supporting sparsity at larger scales, resulting in a lower number of average non-zero elements (from 39 to 24, going from the 0.5B to the 2B model). In turn, this makes all the aforementioned throughput and memory benefits of our kernels grow: the 2B sparse model processes tokens 20.5% faster during inference and trains 21.9% more efficiently with a larger micro-batch size. These findings suggest that sparsity aligns well with recently prevailing scaling trends, highlighting its growing potential relevance for future model development.
4.3The Properties of Sparse Large Language Models
Figure 6:Sparsity statistics and speedup contributions across different layers of our sparse LLMs.We analyze how LLMs effectively allocate sparsity across their layers and batched samples. For our analysis, we collect the activations from2202^{20}input tokens using our 1.5B model trained with the suggested performance-preservingL1=2×10−5L_{1}=2\times 10^{-5}. We complement this subsection with additional results in AppendixD, looking at additional levels of L1 regularization together with how sparsity evolves throughout training and its effects on dead neurons.
Sparsity and model depth. Figure6examines activations across model depth, relating the non-zero statistics of each layer to its respective contribution to inference speed-ups. While the average non-zeros across all layers is less than 30, the figure highlights clear variations in sparsity both across and within individual layers. In particular, the first two layers are the least active, followed by a pronounced hump in the average number of non-zeros across the first half of the network. This sparsity pattern, peaking during early-middle layers, appears consistent with prior work suggesting that a substantial portion of an LLM’s reasoning and knowledge retrieval occurs precisely at these depths(llamas_eng_an). Furthermore, within each layer, the maximum number of non-zeros often exceeds the layer’s mean by more than an order of magnitude and shows no consistent pattern across the architecture. We also observe an intuitive and pronounced inverse correlation between each layer’s average non-zeros and its relative speed-up, with a Pearson coefficient of over -0.996. In contrast, the maximum activation counts have a more limited effect on inference speedups, only noticeably in layer 8. This robustness is due to our kernel design, which hides the latency of highly activated tokens through maximally parallelized execution.
Figure 7:Sparsity statistics across LLM input tokens and positions.Sparsity and input properties. Given the high level of unevenness across activations, we analyze what inputs spur the peaks and troughs in non-zero activation counts. In part a. of Figure7, we identify common tokens with the six lowest and highest average number of non-zeros, filtering out outliers occurring at a lower frequency than1/2141/2^{14}. We find that the tokens with the lowest non-zero activity often represent parts of common web links (doi,nlm,gov,nih) or contractions (doesn,couldn) that precede predictable next tokens in crawled web corpora. In contrast, tokens providing important contextual information about a passage have the highest activity, including particular verbs (loud,enduring) or nouns representing specific locations or substances (Vermont, Greeks,formaldehyde,ACH“). In part b. of Figure7, we then plot how the average non-zeros vary with token position in the input sequence on a log-log scale. Interestingly, we find that the LLM allocates a much greater number of non-zeros to the very first tokens in a sequence, with an exponential decrease thereafter. Intuitively, these results indicate that LLMs appear to effectively focus their computational efforts on tokens with high information content and sequence positions where contextual cues from prior tokens are missing. Here, introducing sparsity not only provides an interpretable lens on model behavior, but also enables our kernels to leverage this inherent information unevenness for significant training and inference speedups.
5Related Work
The emergence of sparsity in LLMs with ReLU activations and its theoretical benefits have been repeatedly documented in earlier work(li2023lazyneuronphenomenonemergence;mirzadeh2023relu_apple_finetune). Since then, more recent methods have been proposed to enhance sparsity by altering modern gated architectures, claiming speedups running sparse feed-forward layers in isolation on older generations of devices. TurboSparse(song2024turbo)studies boosting sparsity via repeated ReLU non-linearities, while ProSparse(song2024prosparse)finetunes pretrained models by manually thresholding activations. Q-Sparse(q_sparse_top_k_related)further deviates from standard models by using a straight-through estimator and retaining only top-K activations. Other work instead focused on introducing structured sparsity post-training, such as by predicting(dejavu_contectual_sparsity_related)and pruning activations to accelerate computation(cats_post_training_thresh_related;teal_post_training_thresh_related). Unlike these efforts, our paper introducesgeneral-purpose kernelsto leverageunstructuredsparsity, demonstrating empirical efficiency benefits during both LLMinference and training. We refer to AppendixEfor an extended overview of prior work that more fundamentally reshapes architecture design.
6Discussion and Future Work
In this work, we leverage unstructured sparsity to lessen the computational burdens of modern LLMs. For inference, we design a new sparse format and fused operations to efficiently execute the whole gated feed-forward blocks in only two kernel launches, minimizing global memory accesses and computation. For training, we introduce a new hybrid algorithm that dynamically schedules computation on both CUDA and Tensor cores, while trivializing storage costs of intermediate activations for backpropagation. We demonstrate that mild L1 regularization induces considerable levels of sparsity with negligible impact on downstream performance – which our kernels translate into significant gains in throughput, energy efficiency, and memory footprint at billion-parameter scales. While our work serves to provide a concrete demonstration of the benefits of sparse LLMs, there are numerous exciting avenues for future extensions. For instance, in AppendixC, we provide preliminary results indicating that the performance of highly sparse LLMs can be further improved with strategies targeted at dead-neuron mitigation. Moreover, fine-tuning existing dense models via recent sparsification approaches(mirzadeh2023relu_apple_finetune;song2024prosparse)would allow bringing the benefits of our kernels to the vast library of pretrained LLMs available in the wild. By sharing our kernels, we hope our work will help promote sparsity as a new design axis to leverage for efficiency, ultimately reducing the growing energy and hardware costs of large-scale foundation models.
Author contribution
Edoardo Cetinconceived the TwELL format, led the implementation and design of the CUDA kernels using TwELL, led model training and benchmarking, and made contributions to writing.
Stefano Peluchettidid early work on sparse model training, advised the project, and made contributions to writing.
Emilio Castilloconceived the hybrid format, co-led the implementation of the CUDA kernels and designed the training extensions, made contributions to kernel benchmarking, and made contributions to writing.
Akira Naruseadvised the project, was involved in early discussions about method design, and worked on early implementations of the sparse kernels.
Mana Murakamiwas involved in early discussions about method design.
Llion Jonesdid initial explorations of sparse model training, advised the project, was involved in early discussions about method design, and made contributions to writing.
References
Appendix AKernels Implementation Details
A.1Inference Kernels Selection
1template<constintT_m,constintT_n,constintT_k>
2structTiles
3{
4alignas(128)__nv_bfloat16a[T_m][T_k];
5alignas(128)__nv_bfloat16b[T_n][T_k];
6};
7
8template<
9constintT_m,
10constintT_n,
11constintT_k,
12constintQUEUE_SIZE,
13constintT_n_compressed,
14intPADDING=4
15>
16structSmemStorage
17{
18Tiles<T_m,T_n,T_k>queue[QUEUE_SIZE];
19alignas(128)uint32_tc_packed[T_m][T_n_compressed+PADDING];
20};
21
22template<
23constintT_m,
24constintT_n,
25constintT_k,
26constintCLUSTER_DIM_m,
27constintCLUSTER_DIM_n,
28constintQUEUE_SIZE,
29constintNUM_ACTIVE_SMs,
30constintT_n_compressed,
31constboolLOOP_OVERFLOW_STORAGE
32>
33__global____launch_bounds__(NUM_THREADS_PER_BLOCK)
34__cluster_dims__(CLUSTER_DIM_m*CLUSTER_DIM_n,1,1)
35voidmm_wgmma_nt_kernel(
36constCUtensorMap__grid_constant__A_tm,
37constCUtensorMap__grid_constant__B_tm,
38constCUtensorMap__grid_constant__C_packed_tm,
39constint*schedule_gmem_ptr,
40constintschedule_size_per_sm,
41constintK
42)
43{
44static_assert(
45(T_m==64*2),
46“OnlyT_m==128supported“
47);
48
49constexprintCLUSTER_SIZE=CLUSTER_DIM_m*CLUSTER_DIM_n;
50extern__shared____align__(1024)unsignedchardynamic_smem[];
51
52intcluster_idx;
53asm(“mov.u32%
54
55intcluster_lane_m;
56asmvolatile(“mov.u32%
57
58intcluster_lane_n=cluster_lane_m%
59cluster_lane_m/=CLUSTER_DIM_n;
60
61auto&tiles_s=
62*reinterpret_cast<
63SmemStorage<T_m,T_n,T_k,QUEUE_SIZE,T_n_compressed>*
64>(dynamic_smem);
65int*schedule_s=reinterpret_cast<int*>(
66dynamic_smem
67+sizeof(SmemStorage<T_m,T_n,T_k,QUEUE_SIZE,T_n_compressed>)
68);
69
70schedule_gmem_ptr+=cluster_idx*schedule_size_per_sm;
71if(threadIdx.x<schedule_size_per_sm){
72schedule_s[threadIdx.x]=schedule_gmem_ptr[threadIdx.x];
73}
74
75__syncthreads();
76
77__shared____align__(8)uint64_tqueue_full[QUEUE_SIZE];
78__shared____align__(8)uint64_tqueue_empty[QUEUE_SIZE];
79
80if(threadIdx.x==0){
81#pragmaunroll
82for(intqueue_idx=0;queue_idx<QUEUE_SIZE;++queue_idx){
83ptx_init_smem_barrier(&queue_full[queue_idx],1);
84ptx_init_smem_barrier(&queue_empty[queue_idx],2*CLUSTER_SIZE);
85}
86}
87
88asmvolatile(“barrier.cluster.arrive;\n”::);
89asmvolatile(“barrier.cluster.wait;\n”::);
90
91if(threadIdx.x<WARP_GROUP_SIZE){
92asmvolatile(“setmaxnreg.dec.sync.aligned.u32%
93
94if(threadIdx.x==0){
95intqueue_idx=0;
96intqueue_phase=0;
97uint16_tmask_multicast_m=0;
98
99ifconstexpr(CLUSTER_DIM_m>1){
100for(inti=0;i<CLUSTER_DIM_m;++i){
101mask_multicast_m|=(1u<<(i*CLUSTER_DIM_n));
102}
103mask_multicast_m<<=cluster_lane_n;
104}
105
106uint16_tmask_multicast_n;
107ifconstexpr(CLUSTER_DIM_n>1){
108mask_multicast_n=
109((1u<<CLUSTER_DIM_n)-1)
110<<(cluster_lane_m*CLUSTER_DIM_n);
111}
112
113for(intschedule_it=0;
114schedule_it<schedule_size_per_sm;
115++schedule_it){
116constintpacked_tile=schedule_s[schedule_it];
117if(packed_tile==-1){
118break;
119}
120
121inttile_coord_m=packed_tile>>16;
122inttile_coord_n=packed_tile&0xFFFF;
123
124ifconstexpr(CLUSTER_DIM_n>1){
125tile_coord_n*=CLUSTER_DIM_n;
126tile_coord_n+=cluster_lane_n;
127}
128ifconstexpr(CLUSTER_DIM_m>1){
129tile_coord_m*=CLUSTER_DIM_m;
130tile_coord_m+=cluster_lane_m;
131}
132
133for(inttile_start_k=0;
134tile_start_k<K;
135tile_start_k+=T_k,++queue_idx){
136if(queue_idx==QUEUE_SIZE){
137queue_idx=0;
138queue_phase^=1;
139}
140
141ptx_wait_barrier(&queue_empty[queue_idx],queue_phase);
142ptx_arrive_tx_smem_barrier(
143&queue_full[queue_idx],
144sizeof(tiles_s.queue[queue_idx].a)
145+sizeof(tiles_s.queue[queue_idx].b)
146);
147
148ifconstexpr(CLUSTER_DIM_n>1){
149if(cluster_lane_n==0){
150ptx_load_tile_tma_multicast_2d(
151&tiles_s.queue[queue_idx].a[0][0],
152&A_tm,
153tile_coord_m*T_m,
154tile_start_k,
155mask_multicast_n,
156&queue_full[queue_idx]
157);
158}
159}else{
160ptx_load_tile_tma_2d(
161&tiles_s.queue[queue_idx].a[0][0],
162&A_tm,
163tile_coord_m*T_m,
164tile_start_k,
165&queue_full[queue_idx]
166);
167}
168
169ifconstexpr(CLUSTER_DIM_m>1){
170if(cluster_lane_m==0){
171ptx_load_tile_tma_multicast_2d(
172&tiles_s.queue[queue_idx].b[0][0],
173&B_tm,
174tile_coord_n*T_n,
175tile_start_k,
176mask_multicast_m,
177&queue_full[queue_idx]
178);
179}
180}else{
181ptx_load_tile_tma_2d(
182&tiles_s.queue[queue_idx].b[0][0],
183&B_tm,
184tile_coord_n*T_n,
185tile_start_k,
186&queue_full[queue_idx]
187);
188}
189}
190}
191}
192}else{
193asmvolatile(“setmaxnreg.inc.sync.aligned.u32%
194intqueue_idx=0;
195intqueue_phase=0;
196constintconsumer_warpgroup_id=
197(threadIdx.x-WARP_GROUP_SIZE)/WARP_GROUP_SIZE;
198constinttile_start_m=consumer_warpgroup_id*WGMMA_m;
199constintconsumer_thread_id=threadIdx.x%
200constuintthread_lane_idx_n=(consumer_thread_id%
201
202constintthread_store_offset_m=(
203tile_start_m
204+consumer_thread_id/32*16
205+(consumer_thread_id%
206);
207constintthread_store_offset_n=
208((consumer_thread_id%
209
210if(consumer_thread_id<CLUSTER_SIZE){
211for(intqueue_idx=0;queue_idx<QUEUE_SIZE;++queue_idx){
212ptx_arrive_barrier_across_cluster(
213&queue_empty[queue_idx],
214consumer_thread_id,
2151
216);
217}
218}
219
220floatC_accum[T_n/16][8];
221for(intschedule_it=0;
222schedule_it<schedule_size_per_sm;
223++schedule_it){
224constintpacked_tile=schedule_s[schedule_it];
225if(packed_tile==-1){
226break;
227}
228
229inttile_coord_m=packed_tile>>16;
230inttile_coord_n=packed_tile&0xFFFF;
231
232ifconstexpr(CLUSTER_DIM_n>1){
233tile_coord_n*=CLUSTER_DIM_n;
234tile_coord_n+=cluster_lane_n;
235}
236ifconstexpr(CLUSTER_DIM_m>1){
237tile_coord_m*=CLUSTER_DIM_m;
238tile_coord_m+=cluster_lane_m;
239}
240
241if(queue_idx==QUEUE_SIZE){
242queue_idx=0;
243queue_phase^=1;
244}
245
246ptx_wait_barrier(&queue_full[queue_idx],queue_phase);
247asmvolatile(“wgmma.fence.sync.aligned;”:::“memory”);
248
249wgmma<T_n,0,1,1,0,0>(
250C_accum,
251&tiles_s.queue[queue_idx].a[tile_start_m][0],
252&tiles_s.queue[queue_idx].b[0][0]
253);
254
255#pragmaunroll
256for(intwgmma_start_k=WGMMA_k;
257wgmma_start_k<T_k;
258wgmma_start_k+=WGMMA_k){
259wgmma<T_n,1,1,1,0,0>(
260C_accum,
261&tiles_s.queue[queue_idx].a[tile_start_m][wgmma_start_k],
262&tiles_s.queue[queue_idx].b[0][wgmma_start_k]
263);
264}
265
266asmvolatile(“wgmma.commit_group.sync.aligned;”:::“memory”);
267asmvolatile(“wgmma.wait_group.sync.aligned%
268
269if(consumer_thread_id<CLUSTER_SIZE){
270ptx_arrive_barrier_across_cluster(
271&queue_empty[queue_idx],
272consumer_thread_id,
2731
274);
275}
276
277queue_idx++;
278
279for(inttile_idx_k=1;
280tile_idx_k<K/T_k;
281++tile_idx_k,++queue_idx){
282if(queue_idx==QUEUE_SIZE){
283queue_idx=0;
284queue_phase^=1;
285}
286
287ptx_wait_barrier(&queue_full[queue_idx],queue_phase);
288asmvolatile(“wgmma.fence.sync.aligned;”:::“memory”);
289
290#pragmaunroll
291for(intwgmma_start_k=0;
292wgmma_start_k<T_k;
293wgmma_start_k+=WGMMA_k){
294wgmma<T_n,1,1,1,0,0>(
295C_accum,
296&tiles_s.queue[queue_idx].a[tile_start_m][wgmma_start_k],
297&tiles_s.queue[queue_idx].b[0][wgmma_start_k]
298);
299}
300
301asmvolatile(“wgmma.commit_group.sync.aligned;”:::“memory”);
302asmvolatile(“wgmma.wait_group.sync.aligned%
303
304if(consumer_thread_id<CLUSTER_SIZE){
305ptx_arrive_barrier_across_cluster(
306&queue_empty[queue_idx],
307consumer_thread_id,
3081
309);
310}
311}
312
313{
314asmvolatile(“cp.async.bulk.wait_group.read0;\n”);
315if(thread_lane_idx_n<=1){
316tiles_s.c_packed[
317thread_store_offset_m+8*thread_lane_idx_n
318][0]=0;
319}
320__syncwarp();
321
322#pragmaunroll
323for(intquadrant_slice_m=0;
324quadrant_slice_m<4;
325quadrant_slice_m+=2){
326intquadrant_store_offset_m=
327thread_store_offset_m+quadrant_slice_m*4;
328
329#pragmaunroll
330for(intwgmma_slice_n=0;
331wgmma_slice_n<T_n/16;
332++wgmma_slice_n){
333intquadrant_store_offset_n=
334thread_store_offset_n+wgmma_slice_n*16;
335
336#pragmaunroll
337for(intquadrant_slice_n=0;
338quadrant_slice_n<8;
339quadrant_slice_n+=4){
340#pragmaunroll
341for(intelement_n=0;
342element_n<2;
343++element_n){
344if(
345C_accum[wgmma_slice_n][
346quadrant_slice_m
347+quadrant_slice_n
348+element_n
349]>0
350){
351uintcurrent_store_idx=__nv_atomic_fetch_add(
352&tiles_s.c_packed[quadrant_store_offset_m][0],
3531u,
354__NV_ATOMIC_RELAXED,
355__NV_THREAD_SCOPE_BLOCK
356);
357
358constuint32_tpacked_value=
359tile_coord_n*T_n
360+quadrant_store_offset_n
361+quadrant_slice_n*2
362+element_n
363|(
364static_cast<uint32_t>(
365__bfloat16_as_ushort(
366__float2bfloat16(
367C_accum[wgmma_slice_n][
368quadrant_slice_m
369+quadrant_slice_n
370+element_n
371]
372)
373)
374)<<16
375);
376
377ifconstexpr(LOOP_OVERFLOW_STORAGE){
378tiles_s.c_packed[quadrant_store_offset_m][
379(current_store_idx&(T_n_compressed-1))+1
380]=packed_value;
381}else{
382tiles_s.c_packed[quadrant_store_offset_m][
383current_store_idx+1
384]=packed_value;
385}
386}
387}
388}
389}
390}
391
392asmvolatile(“fence.proxy.async.shared::cta;\n”);
393asmvolatile(“bar.sync10,256;\n”);
394
395if(threadIdx.x==128){
396ptx_store_transposed_tile_tma_3d<uint32_t,T_n_compressed>(
397&C_packed_tm,
398&tiles_s.c_packed[0][0],
399tile_coord_m*T_m,
400tile_coord_n*T_n_compressed
401);
402asmvolatile(“cp.async.bulk.commit_group;\n”);
403}
404}
405}
406}
407}
Listing 1:Efficient matrix multiplication with TwELL output storage.In FigureLABEL:lst:appA:dense_to_twell, we provide code listings with the device code for our kernel implementing a custom matmul with our new TwELL storage, which we use to run the gate projection in our model. We omit device functions wrapping longer PTX injections for readability. As explained in Section3in the main text, this kernel executes an efficient tiled matrix multiplication, loading the dense input and the dense weight matrix and storing the output values using the Tensor Memory Accelerator (TMA) introduced with Hopper GPUs, while storing the output in the TwELL format during the kernel’s epilogue. The base kernel follows a persistent design with pipelined computation based on persistent cooperative kernels in CUTLASS[nvidia-cutlass]and open-source CUDA reproductions[hilbert-blog]. Unlike CUTLASS, the tile scheduler follows a pre-constructed ordering based on a Hilbert curve to maximize the reuse of the L2 cache[hilbert-paper,hilbert-blog]. In practice, we opt to pack the TwELL valueshh, indiceshIh_{I}, and number of non-zeroshnzh_{nz}in a single 32-bit matrix inℝM×N/C\mathbb{R}^{M\times N/C}. This is done by placing the number of non-zeros for each tile row in the first column and fitting the 16-bit TwELL value and index in the remaining entries. This design ensures strong locality and allows storing and loading the number of non-zeros together with the first 31 TwELL indices and values in a single coalesced access. While this loses a storage position, we set TwELL compression factors very conservatively for each sparsity level, making the occurrence of overflow practically impossible. For instance, we set the compression factor to 8 for our recommended sparsity regularization studied in our main results, with models ranging from 39-24 average non-zeros and an expected chance of overflow of the order of10−3410^{-34}. The TwELL conversion occurs when mapping the partial outputs of the asynchronous warpgroup level matmul instructions (WGMMA) from registers to shared memory via a fast CTA-scoped atomic operation with relaxed semantics. To avoid bank conflicts when resetting the number of non-zeros, we minimally pad the TwELL output with four extra elements in the last dimension. In this instance, we found our padding approach to work significantly better than swizzling, due to the lower register pressure introduced in the kernel’s epilogue. In an alternative implementation, we also explored a different packing layout, placing the number of non-zeros across the diagonal of the first 32-dimensional subtile to cover all memory banks, an approach we found brought minimal throughput improvements at the cost of extra complexity. We note that for the non-gated variant of our models, we use this same kernel to perform the up projection, as this layer is the one determining the overall sparsity pattern of the tile.
1template<
2constintT_n,
3constintT_n_compressed,
4constintNUM_T_n,
5constintOUT_DIM
6>
7__global____launch_bounds__(WARP_SIZE)
8voidmm_t2d_kernel(
9const__nv_bfloat16*IN_d,
10constuint32_t*GATE_OUT_twell_packed_d,
11const__nv_bfloat16*UP_transposed_d,
12const__nv_bfloat16*DOWN_d,
13__nv_bfloat16*OUT_d
14)
15{
16static_assert(
17(OUT_DIM%
18“OUT_DIMmustbedivisiblebyWARP_SIZE.“
19);
20static_assert(T_n_compressed==WARP_SIZE,
21“Warp-syncTwELL-to-denseassumesa32-widecompressedtile.“);
22
23constexprintNUM_LOAD_ITERS=OUT_DIM/STRIDE_8xWARP;
24floatOUT_accum[NUM_LOAD_ITERS][8]={0.0f};
25__nv_bfloat162IN_cached[NUM_LOAD_ITERS][4];
26
27IN_d+=blockIdx.x*OUT_DIM+threadIdx.x*8;
28GATE_OUT_twell_packed_d+=(
29blockIdx.x*T_n_compressed*NUM_T_n+threadIdx.x
30);
31UP_transposed_d+=threadIdx.x*8;
32DOWN_d+=threadIdx.x*8;
33OUT_d+=blockIdx.x*OUT_DIM+threadIdx.x*8;
34
35#pragmaunroll
36for(intiter_idx=0;iter_idx<NUM_LOAD_ITERS;++iter_idx){
37*reinterpret_cast<uint4*>(&IN_cached[iter_idx][0])=
38*reinterpret_cast<constuint4*>(
39IN_d+iter_idx*STRIDE_8xWARP
40);
41}
42
43#pragmaunroll1
44for(inttile_idx=0;tile_idx<NUM_T_n;++tile_idx){
45constintlane_tile_register=
46GATE_OUT_twell_packed_d[tile_idx*T_n_compressed];
47constintnum_nonzeros=
48__shfl_sync(0xFFFFFFFFu,lane_tile_register,0);
49
50#pragmaunroll1
51for(intidx=1;idx<num_nonzeros+1;++idx){
52constuint32_tcompressed_idx_bf16=
53__shfl_sync(0xFFFFFFFFu,lane_tile_register,idx);
54constuint32_tnonzero_idx=compressed_idx_bf16&0xFFFFu;
55
56floatUP_OUT_accum=0.0f;
57
58#pragmaunroll
59for(intiter_idx=0;iter_idx<NUM_LOAD_ITERS;++iter_idx){
60constuint4packed_bfloats_x8=
61*reinterpret_cast<constuint4*>(
62UP_transposed_d
63+nonzero_idx*OUT_DIM
64+iter_idx*STRIDE_8xWARP
65);
66const__nv_bfloat162packed_bfloats_1=
67*reinterpret_cast<const__nv_bfloat162*>(
68&packed_bfloats_x8.x
69);
70__nv_bfloat162scaled_bfloats_1=
71__hmul2(IN_cached[iter_idx][0],packed_bfloats_1);
72float2scaled_floats_1=__bfloat1622float2(scaled_bfloats_1);
73UP_OUT_accum+=scaled_floats_1.x+scaled_floats_1.y;
74
75const__nv_bfloat162packed_bfloats_2=
76*reinterpret_cast<const__nv_bfloat162*>(
77&packed_bfloats_x8.y
78);
79__nv_bfloat162scaled_bfloats_2=
80__hmul2(IN_cached[iter_idx][1],packed_bfloats_2);
81float2scaled_floats_2=__bfloat1622float2(scaled_bfloats_2);
82UP_OUT_accum+=scaled_floats_2.x+scaled_floats_2.y;
83
84const__nv_bfloat162packed_bfloats_3=
85*reinterpret_cast<const__nv_bfloat162*>(
86&packed_bfloats_x8.z
87);
88__nv_bfloat162scaled_bfloats_3=
89__hmul2(IN_cached[iter_idx][2],packed_bfloats_3);
90float2scaled_floats_3=__bfloat1622float2(scaled_bfloats_3);
91UP_OUT_accum+=scaled_floats_3.x+scaled_floats_3.y;
92
93const__nv_bfloat162packed_bfloats_4=
94*reinterpret_cast<const__nv_bfloat162*>(
95&packed_bfloats_x8.w
96);
97__nv_bfloat162scaled_bfloats_4=
98__hmul2(IN_cached[iter_idx][3],packed_bfloats_4);
99float2scaled_floats_4=__bfloat1622float2(scaled_bfloats_4);
100UP_OUT_accum+=scaled_floats_4.x+scaled_floats_4.y;
101}
102
103#pragmaunroll
104for(intbutterfly_stride=WARP_SIZE/2;
105butterfly_stride>0;
106butterfly_stride/=2){
107UP_OUT_accum+=__shfl_xor_sync(
1080xFFFFFFFFu,
109UP_OUT_accum,
110butterfly_stride
111);
112}
113
114const__nv_bfloat162nonzero_feature=
115__bfloat162bfloat162(
116__hmul(
117reinterpret_cast<const__nv_bfloat16*>(
118&compressed_idx_bf16
119)[1],
120__float2bfloat16_rn(UP_OUT_accum)
121)
122);
123
124#pragmaunroll
125for(intiter_idx=0;iter_idx<NUM_LOAD_ITERS;++iter_idx){
126constuint4packed_bfloats_x8=
127*reinterpret_cast<constuint4*>(
128DOWN_d
129+nonzero_idx*OUT_DIM
130+iter_idx*STRIDE_8xWARP
131);
132const__nv_bfloat162packed_bfloats_1=
133*reinterpret_cast<const__nv_bfloat162*>(
134&packed_bfloats_x8.x
135);
136__nv_bfloat162scaled_bfloats_1=
137__hmul2(nonzero_feature,packed_bfloats_1);
138float2scaled_floats_1=__bfloat1622float2(scaled_bfloats_1);
139OUT_accum[iter_idx][0]+=scaled_floats_1.x;
140OUT_accum[iter_idx][1]+=scaled_floats_1.y;
141
142const__nv_bfloat162packed_bfloats_2=
143*reinterpret_cast<const__nv_bfloat162*>(
144&packed_bfloats_x8.y
145);
146__nv_bfloat162scaled_bfloats_2=
147__hmul2(nonzero_feature,packed_bfloats_2);
148float2scaled_floats_2=__bfloat1622float2(scaled_bfloats_2);
149OUT_accum[iter_idx][2]+=scaled_floats_2.x;
150OUT_accum[iter_idx][3]+=scaled_floats_2.y;
151
152const__nv_bfloat162packed_bfloats_3=
153*reinterpret_cast<const__nv_bfloat162*>(
154&packed_bfloats_x8.z
155);
156__nv_bfloat162scaled_bfloats_3=
157__hmul2(nonzero_feature,packed_bfloats_3);
158float2scaled_floats_3=__bfloat1622float2(scaled_bfloats_3);
159OUT_accum[iter_idx][4]+=scaled_floats_3.x;
160OUT_accum[iter_idx][5]+=scaled_floats_3.y;
161
162const__nv_bfloat162packed_bfloats_4=
163*reinterpret_cast<const__nv_bfloat162*>(
164&packed_bfloats_x8.w
165);
166__nv_bfloat162scaled_bfloats_4=
167__hmul2(nonzero_feature,packed_bfloats_4);
168float2scaled_floats_4=__bfloat1622float2(scaled_bfloats_4);
169OUT_accum[iter_idx][6]+=scaled_floats_4.x;
170OUT_accum[iter_idx][7]+=scaled_floats_4.y;
171}
172}
173}
174
175#pragmaunroll
176for(intiter_idx=0;iter_idx<NUM_LOAD_ITERS;++iter_idx){
177__nv_bfloat162__align__(8)packed_bfloats_x8[4];
178packed_bfloats_x8[0]=__floats2bfloat162_rn(
179OUT_accum[iter_idx][0],OUT_accum[iter_idx][1]
180);
181packed_bfloats_x8[1]=__floats2bfloat162_rn(
182OUT_accum[iter_idx][2],OUT_accum[iter_idx][3]
183);
184packed_bfloats_x8[2]=__floats2bfloat162_rn(
185OUT_accum[iter_idx][4],OUT_accum[iter_idx][5]
186);
187packed_bfloats_x8[3]=__floats2bfloat162_rn(
188OUT_accum[iter_idx][6],OUT_accum[iter_idx][7]
189);
190
191*reinterpret_cast<uint4*>(OUT_d+iter_idx*STRIDE_8xWARP)=
192*reinterpret_cast<uint4*>(packed_bfloats_x8);
193}
194}
Listing 2:Fused up and downprojection leveraging gate projections in TwELL format.In FigureLABEL:lst:appA:twell_fused_up_down, we provide code listings with the device code for our kernel implementing the custom fused up and down projection kernel that leverages the gate projections stored in the TwELL format. As explained in Section3in the main text, this kernel is launched on a grid of warp-sized CTAs and fuses the two operations by keeping in memory the input dense feature row and an accumulator. Then, iterating first statically through the TwELL tiles and then dynamically through the number of non-zeros in each tile, it loads the corresponding gate index, which directly maps to a unique column of the up projection and row of the down projection weight matrices. The kernel computes the up-projected feature from a dot product between the input dense feature row and the up projection weight column, multiplies it by the gate value, and finally uses it to scale the down projection weight row before accumulating the output. To ensure coalesced access, we note that the up projection weight matrix is stored in transposed format. This version of the kernel is specialized to handle the case whereTn=256T_{n}=256and the compression ratio is 8, leading to a total of 32 elements for each packed TwELL tile. In this specific instance, we load both the number of non-zeros and all the indices and values for the tile in a single fully coalesced access over the CTA’s warp, which later allows loading the full TwELL tile information via minimal warp register shuffle operations without incurring any shared memory overheads. In preliminary experiments, we also found that re-ordering the kernel calls in descending order of non-zeros can further accelerate performance with low batch sizes. However, we note that we did not find this optimization necessary with large batches and omitted it for simplicity.
1template<
2constintT_n,
3constintT_n_compressed,
4constintNUM_T_n,
5constintOUT_DIM,
6constintSPLIT_OUT_DIM
7>
8__global____launch_bounds__(32)
9voidmm_t2d_kernel(
10constuint32_t*IN_twell_packed_d,
11const__nv_bfloat16*DOWN_d,
12__nv_bfloat16*OUT_d
13)
14{
15static_assert(
16(SPLIT_OUT_DIM%
17“OUT_DIMmustbedivisiblebyWARP_SIZE.“
18);
19static_assert(
20(OUT_DIM%
21“OUT_DIMmustbedivisiblebySPLIT_OUT_DIM.“
22);
23static_assert(T_n_compressed==WARP_SIZE,
24“Warp-syncTwELL-to-denseassumesa32-widecompressedtile.“);
25
26floatOUT_accum[OUT_DIM/STRIDE_8xWARP][8]={0.0f};
27constexprintNUM_LOAD_ITERS=SPLIT_OUT_DIM/STRIDE_8xWARP;
28
29IN_twell_packed_d+=blockIdx.x*T_n_compressed*NUM_T_n+threadIdx.x;
30DOWN_d+=threadIdx.x*8+blockIdx.y*SPLIT_OUT_DIM;
31OUT_d+=blockIdx.x*OUT_DIM+threadIdx.x*8+blockIdx.y*SPLIT_OUT_DIM;
32
33#pragmaunroll1
34for(inttile_idx=0;tile_idx<NUM_T_n;++tile_idx){
35constintlane_tile_register=
36IN_twell_packed_d[tile_idx*T_n_compressed];
37constintnum_nonzeros=
38__shfl_sync(0xFFFFFFFFu,lane_tile_register,0);
39
40#pragmaunroll1
41for(intidx=1;idx<num_nonzeros+1;++idx){
42constuint32_tcompressed_idx_bf16=
43__shfl_sync(0xFFFFFFFFu,lane_tile_register,idx);
44constuint32_tnonzero_idx=compressed_idx_bf16&0xFFFFu;
45const__nv_bfloat162nonzero_feature=
46__bfloat162bfloat162(
47reinterpret_cast<const__nv_bfloat16*>(
48&compressed_idx_bf16
49)[1]
50);
51
52#pragmaunroll
53for(intiter_idx=0;iter_idx<NUM_LOAD_ITERS;++iter_idx){
54constuint4packed_bfloats_x8=
55*reinterpret_cast<constuint4*>(
56DOWN_d
57+nonzero_idx*OUT_DIM
58+iter_idx*STRIDE_8xWARP
59);
60const__nv_bfloat162packed_bfloats_1=
61*reinterpret_cast<const__nv_bfloat162*>(
62&packed_bfloats_x8.x
63);
64__nv_bfloat162scaled_bfloats_1=
65__hmul2(nonzero_feature,packed_bfloats_1);
66float2scaled_floats_1=__bfloat1622float2(scaled_bfloats_1);
67OUT_accum[iter_idx][0]+=scaled_floats_1.x;
68OUT_accum[iter_idx][1]+=scaled_floats_1.y;
69
70const__nv_bfloat162packed_bfloats_2=
71*reinterpret_cast<const__nv_bfloat162*>(
72&packed_bfloats_x8.y
73);
74__nv_bfloat162scaled_bfloats_2=
75__hmul2(nonzero_feature,packed_bfloats_2);
76float2scaled_floats_2=__bfloat1622float2(scaled_bfloats_2);
77OUT_accum[iter_idx][2]+=scaled_floats_2.x;
78OUT_accum[iter_idx][3]+=scaled_floats_2.y;
79
80const__nv_bfloat162packed_bfloats_3=
81*reinterpret_cast<const__nv_bfloat162*>(
82&packed_bfloats_x8.z
83);
84__nv_bfloat162scaled_bfloats_3=
85__hmul2(nonzero_feature,packed_bfloats_3);
86float2scaled_floats_3=__bfloat1622float2(scaled_bfloats_3);
87OUT_accum[iter_idx][4]+=scaled_floats_3.x;
88OUT_accum[iter_idx][5]+=scaled_floats_3.y;
89
90const__nv_bfloat162packed_bfloats_4=
91*reinterpret_cast<const__nv_bfloat162*>(
92&packed_bfloats_x8.w
93);
94__nv_bfloat162scaled_bfloats_4=
95__hmul2(nonzero_feature,packed_bfloats_4);
96float2scaled_floats_4=__bfloat1622float2(scaled_bfloats_4);
97OUT_accum[iter_idx][6]+=scaled_floats_4.x;
98OUT_accum[iter_idx][7]+=scaled_floats_4.y;
99}
100}
101}
102
103#pragmaunroll
104for(intiter_idx=0;iter_idx<NUM_LOAD_ITERS;++iter_idx){
105__nv_bfloat162__align__(8)packed_bfloats_x8[4];
106packed_bfloats_x8[0]=__floats2bfloat162_rn(
107OUT_accum[iter_idx][0],OUT_accum[iter_idx][1]
108);
109packed_bfloats_x8[1]=__floats2bfloat162_rn(
110OUT_accum[iter_idx][2],OUT_accum[iter_idx][3]
111);
112packed_bfloats_x8[2]=__floats2bfloat162_rn(
113OUT_accum[iter_idx][4],OUT_accum[iter_idx][5]
114);
115packed_bfloats_x8[3]=__floats2bfloat162_rn(
116OUT_accum[iter_idx][6],OUT_accum[iter_idx][7]
117);
118
119*reinterpret_cast<uint4*>(OUT_d+iter_idx*STRIDE_8xWARP)=
120*reinterpret_cast<uint4*>(packed_bfloats_x8);
121}
122}
Listing 3:Down projection leveraging up projections from the TwELL format for the non-gated model variants.As mentioned in Section2of the main text, together with modern gated LLMs, we also provide specific kernels that support older non-gated variants, which we empirically evaluate in AppendixC. In FigureLABEL:lst:appA:twell_down_nongated, we provide code listings with the device code for our kernel implementing the custom down projection kernel that leverages up projection activations stored in the TwELL format for these experiments. Similarly to the fused kernel explained in Section3and examined above, this kernel is launched on a grid of warp-sized CTAs and reads the sparsity pattern, this time from the up projection activations stored in the TwELL format. This time, the kernel maintains in memory the out projection and a float32 accumulator for a small output segment. Then, it first statically iterates through the TwELL tile and then dynamically iterates over the number of non-zeros in each tile. At each iteration, it loads the non-zero index and the corresponding activation down projection column segment, before multiplying the two and accumulating the result. In contrast to our fused kernel, where we have to consider full rows on the input and output to perform dot products between the input features and the up projection weights, introducing the split formulation in this kernel is a deliberate and purposeful choice: by introducing trivial duplication of the non-zero reads we can further increase parallelism, reduce register pressure, increase occupancy, and hide longer latencies from uneven sparsity. In practice, we note that using a split dimension of half the base output dimension, leading to two CTAs per output row, appears optimal on our Hopper GPUs.
A.2Training Kernels Selection
1__global__voidtwell_to_ell_kernel(
2const__nv_bfloat16*__restrict__C_vals,
3constuint8_t*__restrict__C_idx,
4constuint32_t*__restrict__C_nnz,
5__nv_bfloat16*__restrict__ell_val,
6int16_t*__restrict__ell_col,
7int32_t*__restrict__row_nnz,
8float*__restrict__l0_out,
9float*__restrict__l1_out,
10intM,
11intN_TILES,
12intBW,
13intELL_W,
14intT_n
15)
16{
17constintrow=blockIdx.x*blockDim.y+threadIdx.y;
18if(row>=M){
19return;
20}
21
22constinttid=threadIdx.x;
23intcnt=(tid<N_TILES)?C_nnz[(size_t)tid*M+row]:0;
24cnt=min(cnt,BW);
25
26intoffset=cnt;
27for(intdelta=1;delta<WARP_SIZE;delta<<=1){
28constintrecv=__shfl_up_sync(0xFFFFFFFFu,offset,delta);
29if(tid>=delta){
30offset+=recv;
31}
32}
33
34constintstart=offset-cnt;
35constinttotal=
36__shfl_sync(0xFFFFFFFFu,offset,min(N_TILES-1,WARP_SIZE-1));
37
38const__nv_bfloat16*sv=
39C_vals+(size_t)row*N_TILES*BW+(size_t)tid*BW;
40constuint8_t*si=
41C_idx+(size_t)row*N_TILES*BW+(size_t)tid*BW;
42
43floatl0_acc=0.0f;
44floatl1_acc=0.0f;
45if(l0_out){
46constfloatinv_M=1.0f/(float)M;
47l0_acc=(float)cnt*inv_M;
48for(inti=0;i<cnt;++i){
49l1_acc+=__bfloat162float(sv[i])*inv_M;
50}
51}
52
53if(cnt>0&&start<ELL_W){
54constintcopy_n=min(cnt,ELL_W-start);
55__nv_bfloat16*dv=ell_val+(size_t)row*ELL_W+start;
56int16_t*dc=ell_col+(size_t)row*ELL_W+start;
57for(inti=0;i<copy_n;++i){
58dv[i]=sv[i];
59dc[i]=(int16_t)(si[i])+(int16_t)(tid*T_n);
60}
61}
62
63if(tid==0){
64row_nnz[row]=total;
65}
66
67if(l0_out){
68for(ints=16;s>0;s>>=1){
69l0_acc+=__shfl_down_sync(0xFFFFFFFFu,l0_acc,s);
70l1_acc+=__shfl_down_sync(0xFFFFFFFFu,l1_acc,s);
71}
72if(tid==0){
73atomicAdd(l0_out,l0_acc);
74if(l1_out){
75atomicAdd(l1_out,l1_acc);
76}
77}
78}
79}
Listing 4:Conversion from TwELL to the hybrid format logic.In FigureLABEL:lst:appA:blocked_ell_to_ell, we provide code listings with the device code for our training kernel used to convert gate activations stored in the TwELL format into the compact ELL component of our hybrid training representation while accumulatingL0L_{0}andL1L_{1}statistics. As discussed in Section3, the conversion dynamically partitions the rows based on the non-zero counts. We allocate a warp to each row, and let each thread read the number of active entries in a single tile. We then use warp register shuffles to obtain an inclusive prefix scan and determine the starting offset of that tile within the destination ELL row. This design allows for directly compacting the tiled representation into contiguous row-wise ELL storage without requiring any synchronization beyond warp-level or shared memory accesses. The kernel writes the true row occupancy torow_nnzrow\_nnzeven when the row exceeds the configured ELL widthELL_WELL\_W, allowing overflow rows to be detected and promoted to the dense tail of the hybrid format. During training, each warp also reduces simpleL0L_{0}andL1L_{1}sparsity statistics to compute the sparsity levels and L1 loss before issuing a single atomic update.
1__global__voidmatmul_save_sparse_like_ell(
2bfloat16*A,
3bfloat16*B_T,
4ELL*out,
5intM,
6intK,
7intN
8)
9{
10constintrow=blockIdx.x;
11constintell_n=out->row_counts[row];
12if(ell_n==0||ell_n>ELL_WIDTH){
13return;
14}
15
16bfloat16*A_row_ptr=A+row*K;
17constintlane_id=threadIdx.x&31;
18constintwarp_id=threadIdx.x>>5;
19constintnum_warps=blockDim.x>>5;
20constintnum_chunks=K/8;
21
22for(intout_idx=warp_id;out_idx<ell_n;out_idx+=num_warps){
23constintcol=out->cols[row*ELL_WIDTH+out_idx];
24bfloat16*B_row_ptr=B_T+col*K;
25floatacc=0.0f;
26
27for(intchunk_base=0;chunk_base<num_chunks;chunk_base+=32){
28constintchunk=chunk_base+lane_id;
29if(chunk>=num_chunks){
30break;
31}
32
33int4a_raw=*(int4*)(A_row_ptr+chunk*8);
34int4b_raw=*(int4*)(B_row_ptr+chunk*8);
35bfloat16_2*a_vec=(bfloat16_2*)&a_raw;
36bfloat16_2*b_vec=(bfloat16_2*)&b_raw;
37
38for(intt=0;t<4;++t){
39float2af=bfloat1622float2(a_vec[t]);
40float2bf=bfloat1622float2(b_vec[t]);
41acc=fmaf(af.x,bf.x,acc);
42acc=fmaf(af.y,bf.y,acc);
43}
44}
45
46for(intoffset=16;offset>0;offset>>=1){
47acc+=__shfl_xor_sync(0xFFFFFFFFu,acc,offset);
48}
49if(lane_id==0){
50out->vals[row*ELL_WIDTH+out_idx]=float2bfloat16(acc);
51}
52}
53}
Listing 5:Dense-to-hybrid matmul for populating the sparse ELL component during training using CUDA cores.In FigureLABEL:lst:appA:save_sparse, we provide code listings of our custom kernel used to perform the efficient dense-to-hybrid matmuls used during training, focusing on the sparse component. This kernel shows the logic of the dynamic hybrid partitioning, skipping the sparse operation in the non-zeros is recognized to exceed the size of the aggressively compact ELL format. The kernel takes two dense matrices,AAandBB(provided asBTB_{T}), and a pre-allocated ELL output “out” of shapeM×NM\times N, whose column indices encode the sparsity pattern to be evaluated. Rather than computing allMNMNoutputs, each thread block processes a single output row and iterates only over the column indices stored for that row in out. For each selected column, the kernel computes the dot product betweenA[row,:]A[row,:]andBT[col,:]B_{T}[col,:]. To maximize coalescing and enable vectorized memory accesses,BBis stored transposed so that rows ofBTB_{T}are contiguous in memory. To maximize throughput with bfloat16, threads loadAAandBTB_{T}in 128-bit transactions (8 bfloat16 values at a time) and accumulate in float32 using fused multiply-adds. Each warp reduces partial sums using shuffle-based reduction, and the final value is written to the corresponding slot in the ELL value array. This design aligns with ELL’s row-oriented storage: the sparsity pattern is known up front, so the kernel avoids both dense materialization and irregular gathers beyond the indexed rows ofBTB_{T}. In the forward pass, we use this kernel to compute only the entries of the up projection operationxWuxW_{u}that will survive the subsequent gating, by copying the sparsity pattern fromhgh_{g}into out and filling its values with the corresponding dot products. In the backward pass, the same kernel is reused for masked gradient matmuls that share a known sparsity pattern. For instance, we use it to compute∇h=∇y,WdT\nabla h=\nabla y,W_{d}^{T}. Rows that exceed the ELL capacity are handled by routing the overflow to the dense backup matrix and computing that portion with Tensor Core–optimized kernels, as described in Algorithm3, and they are multiplied by a binary mask containing the sparsity pattern to be applied.
1__global__voidhybrid_to_dense_mamtul(
2ELL*A,
3bfloat16*B,
4bfloat16*C,
5intM,
6intN,
7intK
8)
9{
10constintrow=blockIdx.x;
11constintell_n=A->row_counts[row];
12if(ell_n==0||ell_n>ELL_WIDTH){
13return;
14}
15
16bfloat16*A_row_vals=A->vals+row*ELL_WIDTH;
17uint16_t*A_row_idxs=A->cols+row*ELL_WIDTH;
18
19for(intn_out=threadIdx.x*8;n_out<N;n_out+=8*blockDim.x){
20float2acc[4];
21for(inti=0;i<4;++i){
22acc[i]=make_float2(0.f,0.f);
23}
24
25for(intk=0;k<ELL_WIDTH;++k){
26if(k>=ell_n){
27break;
28}
29
30constbfloat16a_val=A_row_vals[k];
31constuint16_tcol_idx=A_row_idxs[k];
32bfloat16*B_row_ptr=B+col_idx*N+n_out;
33int4b_vec_raw=*(int4*)(B_row_ptr);
34bfloat162*b_vec=(bfloat162*)(&b_vec_raw);
35constfloata=bfloat162float(a_val);
36
37for(intt=0;t<4;++t){
38float2b_f32=bfloat1622float2(b_vec[t]);
39acc[t].x=fmaf(a,b_f32.x,acc[t].x);
40acc[t].y=fmaf(a,b_f32.y,acc[t].y);
41}
42}
43
44bfloat162*C_ptr=(bfloat162*)(C+row*N+n_out);
45for(inti=0;i<4;++i){
46C_ptr[i]=float22bfloat162(acc[i]);
47}
48}
49}
Listing 6:Hybrid-to-dense sparse matmul using the ELL component during training using CUDA cores.In FigureLABEL:lst:appA:sparse_dense, we provide code listings of our custom kernel used to perform the efficient hybrid-to-dense used during training, focusing on the sparse component. Again, this kernel implements the same dynamic hybrid partitioning logic, skipping the sparse operation in the non-zeros is recognized to exceed the size of the aggressively compact ELL format. In particular, the kernel computes a sparse–dense matrix multiplicationC=ABC=AB, whereAAis stored in ELL format andBBandCCare dense row-major matrices. The kernel maps one CTA per output row ofCC, which aligns naturally with ELL’s row-wise storage and lets the CTA reuse the same sparse row metadata while sweeping across the output columns. Within a CTA, threads partition the output row into contiguous column tiles. For each tile, they iterate over the non-zeros in the corresponding ELL row ofAAand accumulate contributions of the forma⋅B[colidx,:]a\cdot B[col_{i}dx,:]intoC[row,:]C[row,:]. To maximize memory throughput for bfloat16, the kernel accessesBBusing 128-bit SIMD loads, so that each thread processes 8 output elements at a time. Accumulation is performed in float32, and the results are written back in vectorized form, providing a simple and efficient SpMM for the fixed-width ELL layout. Rows that exceed the ELL capacity are handled by routing the overflow to the dense backup matrix and computing that portion with Tensor Core–optimized kernels, as described in Algorithm3. This kernel is used in the forward pass to compute the feedforward layer output. In the backward pass, it is also used to compute gradients with respect to the layer parameters as well as the input activations.
1__global__voidhybrid_transpose(
2ELL*A,
3ELL*A_T,
4bfloat16*tail_A,
5bfloat16_t*tail_A_T,
6intM,
7intN
8)
9{
10for(introw=blockIdx.x;row<M;row+=gridDim.x){
11constintnnz_row=A->row_counts[row];
12if(nnz_row==0||nnz_row>ELL_WIDTH){
13continue;
14}
15
16for(intk=threadIdx.x;k<nnz_row;k+=blockDim.x){
17constuint16_tcol=A->cols[row*ELL_WIDTH+k];
18constbfloat16val=A->vals[row*ELL_WIDTH+k];
19constintout_row=col;
20constintout_col=row;
21constintpos=atomicAdd(A_T->row_counts[out_row],1);
22
23if(pos<ELL_WIDTH){
24constsize_taddr=out_row*ELL_WIDTH+pos;
25A_T->cols[addr]=out_col;
26A_T->vals[addr]=val;
27}else{
28constintd_row=
29get_or_allocate_dense_row(out_row,A_T->tail_map);
30tail_A_T[d_row*M+out_col]=val;
31}
32}
33}
34
35for(intd_row=blockIdx.x;d_row<A->tail_rows;d_row+=gridDim.x){
36constintsrc_row=A->tail_map_reverse[d_row];
37bfloat16_t*src=tail_A+d_row*N;
38
39for(intcol0=threadIdx.x*8;col0<N;col0+=blockDim.x*8){
40int4raw=*(int4*)(src+col0);
41if((raw.x|raw.y|raw.z|raw.w)==0){
42continue;
43}
44
45for(inti=0;i<8;++i){
46constbfloat16_tval=unpack_element(&raw,i);
47if(val==0.0f){
48continue;
49}
50
51constintout_row=col0+i;
52constintout_col=src_row;
53constintpos=atomicAdd(A_T->row_counts[out_row],1);
54
55if(pos<ELL_WIDTH){
56constsize_taddr=out_row*ELL_WIDTH+pos;
57A_T->cols[addr]=out_col;
58A_T->vals[addr]=val;
59}else{
60constintdense_row=
61get_or_allocate_dense_row(out_row,A_T->tail_map);
62tail_A_T[dense_row*M+out_col]=val;
63}
64}
65}
66}
67}
Listing 7:Transposition of the hybrid sparse used during training.In FigureLABEL:lst:appA:transpose_sparse, we provide code listings of our custom kernel used to perform efficient transposition of a matrix stored in our hybrid training format. The kernel takes as input a matrixAAand producesATA_{T}in the same representation: an ELL matrix, plus a dense backup for rows that overflow the maximum number of non-zeros. It operates in two parts. First, it transposes the non-overflow rows stored in the ELL structure by iterating over each row’s non-zeros and inserting them into the corresponding row ofATA_{T}(since a non-zero at(row,col)(\texttt{row},\texttt{col})becomes an entry in rowcolof the transpose). Because many source rows may map to the same destination row, the kernel uses atomic increments to reserve an insertion slot. If the destination row still has capacity, the entry is written into the ELL arrays ofATA_{T}; otherwise, it is routed to the dense backup, using a per-row mapping that allocates a dense-tail row on demand. Second, it handles the overflow rows that are materialized in the dense tail ofAA. These rows are scanned in vectorized chunks (128-bit loads, i.e., 8 bfloat16 values at a time) with a fast zero check to skip all-zero vectors. Only non-zero elements are emitted intoATA_{T}using the same hybrid partitioning logic. This approach keeps transposition efficient while preserving the hybrid format and avoiding expensive conversions to more general sparse layouts. After this kernel completes, we launch a small helper kernel to copy the entries stored in the ELL arrays for rows that overflowed into the corresponding dense-backup rows. We note that the necessity of this final small step comes from the fact that dense rows are only allocated and populated after the ELL slots for a given output row have been exhausted.
Appendix BHyperparameters and Datasets
B.1Training Details
Table 2:Default Hyperparameters for Pretraining Sparse and Non-Sparse LLMs.As explained in Section4of the main text, our sparse models and dense baselines in the main text implement a “Transformer++” architecture with gated feedforward blocks, as common in recent LLMs such as Qwen and Llama[llama2,qwen2]. Moreover, in AppendixC, we also collect results on a non-gated variant of the same architecture, more similar to the traditional architecture, more similar to the original transformer design[vaswani2017attention]. We train all models using the fineweb[fineweb]. In particular, we consider a deduplicated version of the fineweb-edu split, obtained by from an open corpus used to pretrain the SmolLM family of models[smollm]. We note that all our models are trained with the chinchilla-optimal number of tokens[chinchilla]: around 10B tokens for our 0.5B models, 20B tokens for our 1B models, 30B tokens for our 1.5B models, and 40B tokens for our 2B models.
We provide a full list of hyperparameters and training specifications for our training settings and models in Table2. For all models, we use context lengths of 2048 tokens, with batches of 512 sequences, resulting in a global batch size of 1M tokens. For our gated variant, we use a dimensionality of 2048 and a hidden dimension of 5632 in the feedforward blocks, roughly eight-thirds of the hidden size. The main difference with the non-gated variant is that we use a much larger intermediate size of 8192, four times the hidden size, leading to the same total number of parameters. We note that both these choices are considered optimal in the current literature with larger model design practices. When varying model sizes, we modify the number of layers to achieve the target parameter counts. In practice, modern small models have also considered shifting even more of the parameters and flops to the feedforward blocks: for instance, the 1B model of the llama 3 family has a feedforward size of 4x the hidden size even while implementing the gated design[llama3]. While the gains from our sparse kernels could be even greater in these settings, we opted for a more conservative design to avoid biasing our results toward smaller models.
To optimize our models, we use the AdamW optimizer[adamw]with a weight decay of 0.1 and a cosine learning rate schedule starting from a peak learning rate of1.0×10−31.0\times 10^{-3}, after a small warmup of 600 steps. We use the default Adam parameters of(β1,β2,ϵ)=(0.9,0.95,1×10−8)(\beta_{1},\beta_{2},\epsilon)=(0.9,0.95,1\times 10^{-8})and clip gradients at a maximum norm of 1.0. Our vocabulary of tokens comes from a GPT2 tokenizer[gpt2]. We train using standard mixed precision with the bfloat16 format, with our optimizer states stored in full precision.
B.2Task Evaluation Details
We evaluate our models using cloze-formulation scores on seven standard downstream multiple-choice benchmarks that probe logical and commonsense reasoning after pretraining. In particular, we consider ARC (Easy and Hard versions)[bench_1_arc], a grade-school science question answering benchmark comprising both Easy and Challenge subsets, with the latter designed to defeat simple retrieval- and co-occurrence-based baselines; HellaSwag[bench_2_hellaswag], a commonsense sentence completion task that was designed for counterintuitive LLM challenge; OpenBookQA[bench_3_openbook_qa], focused on probing curated sets of science-based and commonsense knowledge; PIQA[bench_4_piqa], a benchmark benchmark focused on physical commonsense reasoning; WinoGrande[bench_5_winogrande], a Winograd-style large-scale conference benchmark; and CommonsenseQA[bench_6_commonsenseqa], evaluating broader commonsense reasoning. We follow standard evaluation protocols and hyperparameters for formatting the input questions.
B.2.1Sparse data structures sizing
We note that the hybrid training format proposed in this paper introduces two core hyperparameters necessary to fulfill its targeted static allocation design: the ELL maximum number of elements per row, and the number of rows in the dense matrix that stores overflowing elements. Both hyperparameters effectively control a trade-off between performance and memory savings, making their value partially dependent on the sparsity level. Moreover, because sparsity can change abruptly during training, we evaluate a set of sizes that can tolerate sudden decreases in sparsity while remaining performant. In practice, we find that setting the maximum number of elements to 128, and the maximum number of backup rows to one-eighth of the token batch size, to be a robust choice for all sparsity levels above1.5×10−51.5\times 10^{-5}. Moreover, below this point, simply doubling the ELL non-zeros prevents other instabilities. However, we note that with prior knowledge of the sparsity evolution, these structures can often be made smaller within training itself. For example, forL1=1×10−4L_{1}=1\times 10^{-4}, we observe that after training stabilizes, we can reduce the number of rows in the dense overflow matrix to 512, enabling higher speedups and additional memory savings. Moreover, the requirements on these two limits differ between the forward and backward passes due to the sparse-matrix transposition used in backpropagation. We note that relevant future extensions could characterize these requirements and develop online tuning of these hyperparameters to improve performance and memory savings further. Finally, when the number of elements exceeds the capacity of our data structures, we currently discard the excess values to avoid a hard failure and set a flag that is reported to the CPU at the next GPU synchronization point. This allows the training system to adaptively increase the structure sizes and repeat the latest training optimization step to avoid any deterioration in the learning dynamics.
Appendix CParameter Studies and Ablations
C.1Performance and Efficiency Across Activation Functions
Table 3:Comparison of performance and efficiency statistics of sparse LLMs leveraging our kernels with traditional gated models using both ReLU and SiLU activations[SiLU_gelu,SiLU_swish,shazeer2020glu].As noted in Section2, many recent LLM architectures have deviated from using ReLUs in favor of smoother activation functions such as SiLU[SiLU_gelu,SiLU_swish]. To provide a direct comparison between the two activations, we collect additional training runs on 30B tokens with our 1.5B model and collect efficiency and performance results. In Table3, we find that, while final cross-entropy appears equivalent, SiLU activations indeed yield slightly higher task accuracy in our evaluation set. However, we note that SiLU LLMs are already marginally slower than non-sparse ReLU LLMs by 0.5%, and due to their inherent non-sparse nature, they cannot support integration with sparsity and, therefore, the benefits of our kernels. Overall, we find these results are consistent with the ones frommirzadeh2023relu_apple_finetuneusing larger OPT models[opt]– appearing to indicate that the advantages of smooth activation functions are minor and could potentially be offset by efficiency considerations.
C.2Non-gated Sparse LLMs
Table 4:Comparison of performance and efficiency statistics of sparse LLMs leveraging our kernels with traditional baselines, considering both gated models[shazeer2020glu], and their original non-gated counterparts used in the original transformer[vaswani2017attention].As explained in Section2, from the simple 2-layer feed-forward block used in the original transformer, there has been a notable shift, with modern models adopting a gated variant due to small but consistent superior empirical results. Nonetheless, in our work, we introduce training and inference kernels for both variants. In contrast to the gated variant, computing the output activations following1, for the non-gated variant, the computation simplifies to:
h=ϕ(xWu),y=hWd,h=\phi(xW_{u}),y=hW_{d},(5)whereϕ\phiis, once again, the non-linear activation function. Thus, whenϕ\phiis a ReLU activation, the sparsity pattern is determined by the up-projection rather than the gate projection. For inference kernels, we note this implies that a difference between the two variants is that the non-gated version performs the up projection rather than the gate projection with our matrix multiplication kernel with TwELL storage introduced in Section3. Moreover, as detailed in AppendixA, we designed an additional kernel optimized to perform the down projection alone starting from the TwELL format.
Thus, to provide a direct comparison between the two variants and the relative effects of sparsity and our custom kernels, we collect additional training runs on 30B tokens with our 1.5B model implementing the non-gated parameterization. In particular, we consider two sparsity levels in addition to a non-sparse baseline (L1=0L_{1}=0): our recommended conservative regularization ofL1=2×10−5L_{1}=2\times 10^{-5}and a more aggressive regularization ofL1=3×10−5L_{1}=3\times 10^{-5}. In Table4, we report the collected relative performance and efficiency results for both variants and all three sparsity levels. As shown, we find only minor performance differences between the two variants, which are likely not significant and attributable to random variations. However, we do note that such differences might become visible only when training with token budgets beyond chinchilla optimality[chinchilla]. Efficiency-wise, while both our variants benefit significantly from our sparse kernels, we find such benefits to be larger for the gated variant. The inference speedup of the non-gated variant is 11.2% atL1=2×10−5L_{1}=2\times 10^{-5}compared to 17.9% for the gated variant at the same sparsity level. At larger sparsity levels, this divide is more pronounced, with the gated variant achieving a 25.5% speedup atL1=3×10−5L_{1}=3\times 10^{-5}compared to only 13.1% for the non-gated variant. These results are intuitively based on the nature of both models, as the gated variant allows our new inference kernels to leverage the opportunity of efficient fast fusion of both up and down projections. Nonetheless, they also demonstrate that the benefits of our kernels extend beyond a single architectural choice.
C.3Strategies for Dead Neuron Mitigation
Table 5:Comparison of performance and efficiency statistics of sparse LLMs leveraging our kernels with traditional baselines trained using our standard recipe, or with dead neuron mitigation strategies such as warming up the coefficient of the L1 loss and applying targeted reinitialization to the gate projection’s weights.

Figure 8:Number of non-zeros and fraction of dead neurons of LLMs with different strategies for dead neuron mitigation throughout training.While we find that using an L1 coefficient ofL1=2×10−5L_{1}=2\times 10^{-5}provides a relevant boost in efficiency without any noticeable downstream performance degradation, we explore preliminary directions to mitigate the potential downsides of sparse training. In particular, as detailed in AppendixD, when examining the number of active neurons throughout training, we see that over 30% of the neurons become permanently inactive on average across layers when using our recommended L1 coefficient, with this metric considerably rising for higher regularizations. While for our recommended coefficient, this symptom does not seem to evidently reflect on downstream performance, reducing such an effect could potentially allow supporting even higher sparsity before incurring performance degradation.
Based on these considerations, we explore two preliminary extensions to our simple L1-regularized training recipe explained in Section2. First, we consider simply scheduling the L1 regularization, motivated by our findings that dead neurons appear to arise very early during training. Concretely, we first train our models for 5000 steps without any L1 regularization, followed by a further 5000 steps of linear increase of the L1 coefficient. We make the training setting artificially similar to prior work that focuses on finetuning and continued-pretraining[song2024prosparse,q_sparse_top_k_related]. Second, we consider implementing a target reinitialization strategy to lower the magnitude and reinject random noise only in the columns of the gate projection that lead to always negative outputs (which then, after ReLU, lead to dead neurons). Given the model’s initialization standard deviationσ=0.02\sigma=0.02, we noised and rescaled to regress the weights to their initial state, essentially interpolating with a coefficientλ\lambda:
Wg[:,j]←(1−λ)Wg[:,j]+λ𝒩(0,σ2),W_{g}[:,j]\leftarrow(1-\lambda)W_{g}[:,j]+\lambda\mathcal{N}(0,\sigma^{2}),(6)We apply this targeted reinitialization after every training step, which we find does not significantly affect training time. In preliminary experiments, We foundλ=0.1\lambda=0.1to be a good choice that avoids affecting training dynamics while injecting sufficient noise to revive dead neurons. We note this strategy is similar to older techniques for reinjecting plasticity into architectures in continual learning and other non-stationary settings[warmstarting_nn].
In Table5, we report the performance and efficiency results of our two strategies compared to our standard recipe and the non-sparse baseline, while in Figure8we analyze the number of non-zero activations and dead neurons throughout training. When looking at the dead neuron statistics, we find that both strategies almost entirely mitigate the emergence of dead neurons. However, we immediately see a concerning pattern with the sparsity-warmup strategy, as the number of non-zeros considerably increases throughout training. In particular, the considered coefficient ofL1=3×10−4L_{1}=3\times 10^{-4}, which is ten times larger than our recommended value, leads to over 100 non-zeros on average across layers at the end of training, compared to only 29 non-zeros when using our standard recipe withL1=2×10−5L_{1}=2\times 10^{-5}. We note that, in early experiments, we found that increasing the L1 coefficient further led to training instabilities and loss spikes. In contrast, using the targeted dead neuron reinitialization, we find similar non-zero statistics to our standard recipe while still effectively mitigating dead neurons. Furthermore, as reported in Table5, we find that this latter strategy provides a small boost in both downstream performance and efficiency, processing tokens 19.1% faster than the non-sparse baseline with our default L1 coefficient ofL1=2×10−5L_{1}=2\times 10^{-5}. We believe these preliminary results suggest that further research in examining optimal sparse training would potentially further increase the relevance and efficiency upsides of sparse LLMs.
Appendix DExtended Results
D.1Sparsity and Dead Neurons During Training


Figure 9:Number of non-zeros and fraction of dead neurons of LLMs across L1 regularization levels throughout training.In Figure9, we provide detailed results about how activation sparsity and dead neuron occurrence evolve during training for all our different L1 regularization levels. In particular, we record dead neurons at the end of each training step by keeping track, for each hidden feedforward activation of each layer, the last time it was non-zero. If a neuron was never active for a whole training step (just above 1M tokens), we consider it dead for that step.
We make two immediate observations from these results. First, we find that the sparsity levels settle early on to low values after only around 1000 training steps (around 1B tokens). Due to this property, we note that the throughput and memory advantages of our training kernels become relevant almost at the inception of our training runs. Second, we observe that the same trend applies to the number of dead neurons: our recommendedL1=2×10−5L_{1}=2\times 10^{-5}already exceeds 30% inactivity, which further monotonically increases with higher regularization levels. While for our recommended coefficient, this symptom does not seem to evidently reflect on downstream performance, reducing such an effect could potentially allow supporting even higher sparsity before incurring performance degradation. To this end, we note that in AppendixCwe provide preliminary results indicating that the performance of sparse LLMs could be further improved with strategies targeted at dead-neuron mitigation.
D.2Task Performance Details
Table 6:Granular comparison of per-task downstream performance across model scales to complement Table1.To complement the results in Section4in the main text, we provide the detailed granular results of downstream task performance across the seven downstream tasks considered, targeting logic and reasoning capabilities after pretraining[bench_1_arc,bench_2_hellaswag,bench_3_openbook_qa,bench_4_piqa,bench_5_winogrande,bench_6_commonsenseqa]. In particular, we report the per-task accuracies for both sparse models, using our recommended conservative L1 regularization of2×10−52\times 10^{-5}, and their non-sparse counterparts across all the examined model scales. As shown in Table6and consistently with our main text analysis, we do not find significant performance differences between sparse and non-sparse models for our regularization level and all considered tasks. We do, indeed, observe an expected performance rise with larger models across the great majority of tasks.
D.3Activation Sparsity at High and Low Levels
Figure 10:Sparsity statistics and speedup contributions across different layers of non-sparse LLMs.
Figure 11:Sparsity statistics and speedup contributions across different layers of an LLM with high regularizationL1=104L_{1}=10^{4}.
Figure 12:Training speedups from our sparse LLM training kernels across L1 regularization levels for both H100 and RTX6000 devices.To complement the analysis results provided in Section4of the main text, we examine how sparsity regularization affects the distribution of non-zero activations across model depth and relate these metrics to the corresponding speed-up contributions from our kernels during inference. While in our main analysis we reported and analyzed the LLM trained with our recommended conservative L1 regularization of2×10−52\times 10^{-5}, in Figures10we provide analogous results for a non-sparse LLM while in Figure11we analyze an LLM trained with the highest regularization regularization level considered (1×10−41\times 10^{-4}). We note that for non-sparse models, due to the high number of non-zeros, the contributions of applying our kernel are actually detrimental – and as such, we report the speed-up contributions as negative percentages. A first observation from the sparsity statistics is that the average number of non-zeros also follows a noticeable trend in the non-sparse model, with the first few layers being the least active, followed by a hump with a peak in activations. However, a key difference comes with the location of the hump: while in our recommended sparse model the peak occurs around layer 6, in the non-sparse LLM the peak still occurs within the first half of the network but is shifted visibly deeper into the architecture around layer 13. Interestingly, in the high-regularization LLM, we actually observe that while the very first layer is again the least active, there are two different peaks – one very early around the second layer and another one in the last layer of the model. Once again, we find that maximum activation counts can easily be well over an order of magnitude higher than the average, with no clear pattern across layers. For the non-sparse model, we again observe a strong inverse correlation between each layer’s average non-zeros and its relative speed-up. In contrast to the high-regularization LLM, this correlation is much less visible, as given the high sparsity encountered, the speedups of our kernels are already at their achievable maximum for almost all layers, essentially making executing the up and down projection negligible in the overall computation time.
D.4Improving Efficiency of other Devices
As mentioned in Section4, given that our kernels consistently reduce memory requirements during training, and as a side benefit, reduce reliance on newer tensor core units, they immediately have higher potential relevance for less capable hardware. Thus, to empirically validate these considerations, we provide additional results comparing the performance speedups of our kernels during training on NVIDIA’s RTX PRO 6000 GPUs against the H100 PCIe GPUs used throughout our main paper and other experiments. Some of the other crucial differences of this GPU come from the memory side, with a considerably reduced memory bandwidth (1.59 TB/s vs. 2.0 TB/s). In contrast, the RTX PRO 6000 can benefit from a larger number of Streaming Multiprocessors than the H100 (188 vs. 114), potentially allowing for greater occupancy for sparse workloads.
As shown in Figure12, and in line with our considerations, we find significantly higher speedups on the RTX 6000 GPU across all L1 regularization levels considered. These speedup differences are even more pronounced at higher regularization levels, extending the practical range of L1 coefficients make sparsity provide meaningful efficiency improvements. When dissecting what causes these greater speedups, we first find that thanks to the specific H100 features, such as the higher tensor cores throughput, the runtime of the dense GEMM operations increases from around 400 to 800 microseconds on the RTX 6000. Similarly, kernels that are memory bandwidth bound, including the dense to hybrid matrix multiplication, are also slightly slower by 19% on the RTX 6000 than on the H100. However, once in our hybrid sparse format, due to the larger Streaming Multiprocessors count of the RTX 6000 GPU, the sparse operations run faster than on the H100, with speedups of 1.34×\timesand 2.1×\timesfor sparse-to-dense and transposition operations, respectively. We find these results indicate that leveraging sparsity with targeted kernels could significantly improve the performance of cheaper devices, which do not implement the latest hardware innovations of higher-end units such as the H100, lowering the field’s canonical hardware barriers.
Appendix EFurther Related Work
E.1Activation Sparsity in Transformers
Expanding on the findings ofzhang-etal-2022-moefication,li2023lazyneuronphenomenonemergencedocuments that Transformer MLP layers with ReLU activations exhibit inherent activation sparsity across architectures, depths, and data distributions. Building on this observation,mirzadeh2023relu_apple_finetuneshows that replacing GELU with ReLU in non-gated feed-forward layers yields negligible performance degradation while enabling up to three times theoretical inference speedup with less computation. However, they focus on older architectures (OPT models) with non-gated feed-forward blocks and leave efficient kernel implementation to future work.
More recent methods have also been proposed to enhance sparsity after altering modern gated architectures and have claimed speedups when running sparse feedforward layers in isolation on older generations of devices. TurboSparse[song2024turbo]proposes a modification to the feed-forward block itself, introducing dReLU, which applies ReLU tobothgate and up projections:h=ReLU(xWg)⊙ReLU(xWu)h=\mathrm{ReLU}(xW_{g})\odot\mathrm{ReLU}(xW_{u}). ProSparse[song2024prosparse]proposes finetuning pretrained models and artificial thresholding of the activations to increase sparsity. Q-Sparse[q_sparse_top_k_related], further deviates from standard architectures via maintaining only the top-K activations and applying a straight-through estimator. We also note that additional works proposed introducing structured sparsity post-training, such as by predicting[dejavu_contectual_sparsity_related]and pruning activation to set sparsity levels[cats_post_training_thresh_related,teal_post_training_thresh_related]. Unlike these works, our paper introduces general-purpose kernels to leverage unstructured sparsity, demonstrating empirical efficiency benefits during LLM training and inference.
E.2Architectural Approaches to Sparsity
Mixture-of-Experts (MoE) architectures[DBLP:journals/corr/ShazeerMMDLHD17,lepikhin2020gshard,fedus2022switch]partition feed-forward layers into separately routed experts, decoupling model capacity from per-token computation. However, MoE requires predetermining the number of experts and sparsity level before training, limiting adaptability to input complexity.
Product key memory[lample2019largememorylayersproduct]maintains fixed sparsity patterns throughO(logn)O(\log n)key retrieval. PEER[he2024mixturemillionexperts]extends this approach to over one million single-neuron experts with 99.99% architectural sparsity. UltraMem[huang2025ultrasparsememorynetwork]improves PKM and scales to 20 million memory slots, showing that it can outperform MoE with the same parameter and computation budgets. Fast Feedforward Networks[belcak2023fastfeedforwardnetworks]use differentiable binary trees to achieve 99% sparsity.
While these architectural approaches achieve extreme sparsity, they require substantial modifications to standard Transformer training pipelines. Our approach instead works with conventional architectures, requiring only a change of activation function and optional regularization, making it readily applicable to existing models and training infrastructure.
Similar Articles
@Suryanshti777: NVIDIA just revealed the hidden tricks they’re using to make LLM fine-tuning dramatically faster. Not new GPUs. Not big…
NVIDIA and Unsloth have published a technical guide detailing three low-level optimizations that can accelerate LLM fine-tuning by up to 25%, including packed-sequence caching, double-buffered checkpointing, and optimized MoE routing. The guide provides deep systems-level explanations and benchmarks aimed at ML engineers and developers.
Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks
Researchers introduce BEHEMOTH benchmark and CluE cluster-based prompt optimization to enable LLMs to extract and retain heterogeneous memory across diverse tasks, achieving 9% gains over prior self-evolving frameworks.
@ickma2311: Efficient AI Lecture 12: Transformer and LLM This lecture is not only about how LLMs work. It also explains the buildin…
Lecture notes from an Efficient AI course covering Transformer and LLM fundamentals, including multi-head attention, positional encoding, KV cache, and the connection between model architecture and inference efficiency. The content explains how design choices in transformers affect memory, latency, and hardware efficiency.
@KL_Div: LLMs require more GPU memory as they generate longer responses. Can we make GPU memory constant without significantly s…
IceCache introduces Dynamic Continuous Indexing to keep GPU memory usage constant during long LLM generations with minimal accuracy loss.
Block-sparse GPU kernels
OpenAI releases block-sparse GPU kernels, a tool for efficient sparse matrix multiplication on GPUs that reduces computation and memory requirements for neural network operations.