@ickma2311: Efficient AI Lecture 13: LLM Deployment Techniques The lecture helped me understand AWQ, vLLM, and FlashAttention very …

X AI KOLs Timeline 05/13/26, 06:44 PM News

llm-deployment quantization flash-attention memory-management kernel-optimization systems

Summary

A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.

Efficient AI Lecture 13: LLM Deployment Techniques The lecture helped me understand AWQ, vLLM, and FlashAttention very well. They are all about the same systems idea: efficient LLM serving = reduce memory movement while preserving quality. - AWQ: activation-aware INT4 weight quantization - vLLM: PagedAttention for KV-cache memory management - FlashAttention: tiling + fusion to avoid HBM traffic LLM deployment is not just fewer FLOPs. It is memory hierarchy + kernels + scheduling. My note: https://ickma2311.github.io/ML/HW-SW-codesign/efficient-ai-lecture-13-llm-deployment-techniques.html…

Original Article Export to Word Export to PDF

View Cached Full Text

Cached at: 05/14/26, 06:31 AM

LLM Deployment Techniques – ∇ ickma.dev

Source: https://ickma2311.github.io/ML/HW-SW-codesign/efficient-ai-lecture-13-llm-deployment-techniques.html This lecture is about the systems side of deploying large language models.

The main idea is:

LLM deployment is limited less by arithmetic alone and more by memory movement, quantization error, KV-cache management, and request scheduling.

For training, the hard question is how to distribute a large model. For deployment, the hard question is how to serve many requests with low latency and high throughput while keeping model quality.

Quantization

Quantization lowers numeric bit width so inference can use less memory bandwidth, less storage, and cheaper arithmetic.

For LLMs, weights are usually easier to quantize than activations. Activations are harder because a few channels may contain very large outliers. A useful observation is that these outliers often persist in fixed channels.

That makes it possible to rebalance the quantization problem:

scale down difficult activation channels
scale up the corresponding weight channels
preserve the same matrix multiplication result

Activation Smoothing

Suppose a linear layer computes:

\[ Y = XW. \]

If one activation channel has much larger magnitude than the others, activation quantization needs a large scale, and most values lose precision.

Activation smoothing introduces a positive per-channel scale vector\(s\):

\[ \hat{X}=X\operatorname{diag}(s)^{-1}, \qquad \hat{W}=\operatorname{diag}(s)W. \]

Then:

\[ \hat{X}\hat{W} = X\operatorname{diag}(s)^{-1}\operatorname{diag}(s)W = XW. \]

So the computation is unchanged, but the quantization difficulty is shifted from activations into weights.

One common smoothing choice is:

\[ s=\sqrt{\frac{\max |X|}{\max |W|}}, \]

where the maxima are computed per channel from calibration data.

Offline Calibration

During calibration, we run representative data through the model and measure per-channel activation ranges.

The goal is not to train new parameters. The goal is to estimate which activation channels are difficult to quantize and choose smoothing scales that balance activation and weight ranges.

Offline Smoothing

During smoothing, the scale is folded into static model parameters.

For example, the previous normalization layer can be adjusted so it emits smoothed activations, while the current linear layer’s weight channels are scaled in the opposite direction.

The important deployment property is that this transformation can be folded offline. At runtime, the model does not need an extra explicit scaling operation.

Inference

After smoothing, activations are less dominated by outliers and become easier to quantize. The model can then use efficient low-bit or INT8 operations with less quality loss.

This is the core idea behind SmoothQuant-style deployment:

activations are hard to quantize
weights can absorb some scaling
an equivalent transformation makes runtime quantization easier

Weight-Only Quantization

LLM decoding is often memory-bound. During autoregressive decoding, the system repeatedly streams model weights from memory while generating one token at a time.

That means even A8W8 may not be enough. Weight bandwidth can become the bottleneck, so compressing weights to INT4 can be very valuable.

Naive round-to-nearest quantization at very low bit width can fail badly. For example, applying simple RTN to large OPT models can cause a large perplexity spike. Extreme weight compression needs a more careful method.

AWQ

AWQ stands for Activation-aware Weight Quantization.

Its key observation is:

Not all weight channels matter equally, and importance is better identified through activations than by weight magnitude alone.

Keeping or protecting a very small fraction of salient channels can greatly improve quality under INT4 weight-only quantization.

Selecting Salient Channels

AWQ selects salient channels using activation statistics. A weight channel that connects to large activations can contribute much more to output error if quantized poorly.

This is why activation-aware importance is more useful than weight-only magnitude.

Scaling Instead of Mixed Precision

One direct idea is to keep the top 1% salient weights in higher precision. But mixed precision inside the same tensor is awkward for hardware.

AWQ instead uses scaling.

For salient channels, scale weights up before quantization and scale the corresponding activations down:

\[ WX \approx Q(W \cdot s)(s^{-1}X). \]

The scaling preserves the mathematical computation while reducing the relative quantization error of important weight channels.

Why does scaling help?

For standard quantization:

\[ \operatorname{Err}(Q(w)\cdot x) = \Delta \cdot \operatorname{Err}\left(\operatorname{Round}\left(\frac{w}{\Delta}\right)\right) \cdot x. \]

For scaled weights:

\[ \operatorname{Err}\left(Q(w\cdot s)\cdot \frac{x}{s}\right) = \Delta’ \cdot \operatorname{Err}\left(\operatorname{Round}\left(\frac{w\cdot s}{\Delta’}\right)\right) \cdot \frac{x}{s}. \]

If\(s\)is moderate and only applied to a small subset of channels, the global quantization step may stay roughly unchanged, so\(\Delta’ \approx \Delta\). The error contribution for those channels is then reduced roughly by a factor of\(s\).

Optimal Scaling Search

AWQ searches for a scale that minimizes reconstruction error:

\[ \mathcal{L}(s) = \left\|Q(W\cdot s)(s^{-1}X)-WX\right\|. \]

The scale is parameterized by activation statistics:

\[ s=s_X^\alpha, \qquad \alpha^*=\arg\min_\alpha \mathcal{L}(s_X^\alpha). \]

The search is small and calibration-efficient, because it is not trying to regress every weight. It is choosing a channel scaling rule that preserves layer output.

Advantages

AWQ has several practical advantages:

it is simple to implement
it is friendly to hardware kernels
it needs relatively little calibration data
it is more robust to calibration-distribution mismatch than heavier regression-based methods
it works across instruction-tuned and multimodal models

INT4 Inference Kernels

Weight-only INT4 quantization reduces memory traffic, but hardware usually reads bytes, not individual 4-bit values.

That makes packing layout important.

AWQ-style kernels can pre-shuffle INT4 weights offline so unpacking is efficient at runtime. The layout allows SIMD instructions to decode many weights in parallel:

mask to extract low 4-bit values
shift the byte block right by 4 bits
mask again to extract the other 4-bit values

The lesson is that compression is not only an algorithmic issue. The bit layout must match the hardware’s vector operations.

Kernel Fusion

Efficient LLM kernels avoid unnecessary trips to DRAM.

For attention, a fused kernel can combine operations such as batched matrix multiplication, masking, softmax, and the second matrix multiplication while keeping intermediates in fast memory.

For quantized GEMM, dequantization can be fused into matrix multiplication:

CUDA cores unpack and dequantize INT4 weights
tensor cores consume the dequantized values for matrix multiply
the full decompressed model is never written back to memory

This is why low-bit inference depends on both quantization method and kernel design.

Pruning & Sparsity

Weight Sparsity

For LLM pruning, activation-aware criteria can outperform simple weight magnitude.

A useful pruning score is:

\[ |W| \cdot \|X\|. \]

This keeps weights that are important under the activations the model actually sees.

Contextual Sparsity

Static sparsity permanently removes weights. That can hurt accuracy because the same sparse subnetwork must serve every input.

Contextual sparsity is input-dependent. Different inputs activate different attention heads, MLP features, or neurons. Systems such as DejaVu try to predict which computations are unnecessary for the current context and skip them dynamically.

The key distinction:

static sparsity asks which weights are globally unnecessary
contextual sparsity asks which computations are unnecessary for this input

Mixture of Experts

Mixture-of-Experts models use sparse activation at the module level.

Instead of sending every token through the same dense feed-forward network, a router chooses a small number of experts for each token.

This decouples total parameter count from per-token inference cost:

more experts increase total model capacity
only selected experts run for each token
inference cost can stay closer to a smaller dense model

Capacity Factor

Routing is not always balanced. Some experts may receive more tokens than others.

The capacity factor controls how many tokens each expert is allowed to process:

\[ \text{expert capacity} = \frac{\text{tokens per batch}}{\text{number of experts}} \times \text{capacity factor}. \]

A low capacity factor is efficient but risks dropping tokens when routing is imbalanced. A higher capacity factor gives slack capacity but increases memory and compute overhead.

Attention Sparsity

Attention sparsity reduces work by avoiding unimportant tokens, heads, or value fetches.

Examples:

prune tokens with low cumulative attention score
remove heads that contribute little
skip fetching a value vector when the query-key score is too small
start with lower precision and use higher precision only when confidence is low

The goal is to avoid spending full attention cost on information that will not affect the output much.

LLM Serving Systems

LLM Serving Metrics

Deployment quality is measured through both user-facing latency and system-wide throughput.

Time to First Token

Time to first token, or TTFT, is the delay between submitting a request and receiving the first generated token.

It is especially important for interactivity. High TTFT makes the model feel unresponsive.

Time Per Output Token

Time per output token, or TPOT, is the average time to generate each subsequent token after the first one.

This controls the apparent streaming speed of the response.

Latency

End-to-end latency is:

\[ \text{latency} = \text{TTFT} + \text{TPOT} \times \text{number of generated tokens}. \]

Throughput

Throughput is the total number of tokens generated per second across all users and requests.

Users tend to care about latency. Service providers also care about throughput, because it determines how much traffic the same hardware can serve.

PagedAttention

The KV cache is a major memory problem in LLM serving.

Waste comes from two sources:

output length is unknown, so systems may over-allocate cache memory
not all allocated cache memory is actively used at every stage

PagedAttention borrows the idea of paging from operating systems.

Instead of requiring one contiguous KV-cache block per sequence, PagedAttention breaks the cache into fixed-size blocks.

It then uses a block table to map logical token blocks to physical memory blocks.

How It Works

Split each sequence’s KV cache into logical blocks.
Allocate physical GPU memory blocks on demand.
Store logical blocks in non-contiguous physical memory.
Use a block table to translate logical block IDs into physical block IDs.

This avoids large contiguous allocations and reduces fragmentation.

Benefits

PagedAttention improves serving efficiency because:

memory is allocated closer to actual need
more concurrent requests fit in GPU memory
throughput increases through larger effective batching
shared prompt prefixes can reuse physical KV-cache blocks

This memory-sharing feature is useful for beam search, parallel sampling, and workloads where several continuations share the same prompt.

FlashAttention

Standard attention is memory-bound.

The core issue is the memory hierarchy:

GPU SRAM is small but very fast
GPU HBM is large but much slower
materializing the full\(N\times N\)attention matrix causes heavy HBM traffic

FlashAttention does not change the attention math. It changes the IO pattern.

Tiling

FlashAttention splits the computation into tiles that fit in SRAM.

It loads blocks of\(Q\),\(K\), and\(V\), computes attention locally, and avoids writing the full attention matrix to HBM.

Kernel Fusion

FlashAttention fuses operations such as score computation, masking, softmax, and value aggregation into a single IO-aware kernel.

This reduces memory traffic and makes long-context attention much more practical.

The key idea is:

Attention efficiency is not only FLOPs. It is also how many times data moves through the memory hierarchy.

Speculative Decoding

Autoregressive decoding generates tokens one by one. For small batch sizes, the process is often memory-bound because each token step streams large model weights.

Speculative decoding uses two models:

a small draft model
a large target model

The draft model quickly proposes several tokens. The target model verifies those draft tokens in parallel.

If the draft tokens agree with the target model’s distribution, multiple tokens can be accepted at once. If not, the target model corrects the sequence.

The important property is that speculative decoding can preserve the target model’s output distribution while improving speed, often by 2x to 3x when the draft model is accurate enough.

Batching

Batching increases GPU utilization by serving multiple requests together.

MethodHow it worksTradeoffNo batchingProcess one request at a timeSimple but inefficientStatic batchingWait for a fixed batch sizeGood for offline work but increases waiting timeDynamic batchingBatch when full or after a timeoutBalances throughput and latencyContinuous batchingSchedule requests token by tokenBest suited for LLM serving because new requests can join as others finishContinuous batching, also called in-flight batching, is especially important because LLM requests have different prompt lengths and generation lengths. Token-level scheduling avoids idle slots left by completed requests.

Summary

LLM deployment is a stack of systems optimizations:

SmoothQuant shifts activation difficulty into weights
AWQ protects salient weight channels through activation-aware scaling
INT4 kernels need hardware-friendly packing and fused dequantization
sparsity saves work only when the runtime can exploit it
MoE increases parameter count without activating every parameter per token
PagedAttention makes KV-cache memory management more efficient
FlashAttention reduces HBM traffic through tiling and fusion
speculative decoding and batching improve serving throughput and latency

The common theme is that efficient LLM serving depends on matching model transformations to the memory hierarchy and runtime scheduler.

LLM Deployment Techniques – ∇ ickma.dev

Quantization

Activation Smoothing

Offline Calibration

Offline Smoothing

Inference

Weight-Only Quantization

AWQ

Selecting Salient Channels

Scaling Instead of Mixed Precision

Optimal Scaling Search

Advantages

INT4 Inference Kernels

Kernel Fusion

Pruning & Sparsity

Weight Sparsity

Contextual Sparsity

Mixture of Experts

Capacity Factor

Attention Sparsity

LLM Serving Systems

LLM Serving Metrics

Time to First Token

Time Per Output Token

Latency

Throughput

PagedAttention

How It Works

Benefits

FlashAttention

Tiling

Kernel Fusion

Speculative Decoding

Batching

Summary

Similar Articles

@ickma2311: Efficient AI Lecture 12: Transformer and LLM This lecture is not only about how LLMs work. It also explains the buildin…

@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…

@ickma2311: CMU Advanced NLP: Reinforcement Learning I had been curious about how RL works on top of LLMs, and this CMU lecture mad…

Efficient Memory Management for Large Language Model Serving with PagedAttention

@phosphenq: This 2 hour video by Andrej Karpathy (co-founder of OpenAI) will teach you more about using LLMs than every AI tutorial…

Submit Feedback