@AndrewYNg: New course on serving LLMs efficiently -- how do you serve models to many concurrent users at low latency and reasonabl…
Summary
Andrew Ng and DeepLearning.AI have launched a new short course on efficient LLM inference with vLLM, built in partnership with Red Hat, covering quantization, PagedAttention, continuous batching, and benchmarking for serving LLMs at scale.
View Cached Full Text
Cached at: 06/05/26, 02:19 AM
New course on serving LLMs efficiently – how do you serve models to many concurrent users at low latency and reasonable cost? This short course is built with @RedHat and taught by @cedricclyburn.
Efficient LLM serving requires efficient memory management. A 70B-parameter model takes ~140 GB just to load the weights. On top of that, every active request needs its own chunk of GPU memory, the KV cache, to store the token context it has built up so far. In this course, you’ll learn to reduce a model’s memory footprint with quantization and serve it using vLLM, which handles many concurrent requests efficiently through smart memory management.
Skills you’ll gain:
- Quantize a model and measure the accuracy tradeoff
- Serve a model with vLLM and watch it handle concurrent requests efficiently
- Benchmark your deployment and make informed tradeoffs between speed, cost, and accuracy
Join and learn to serve LLMs efficiently: https://deeplearning.ai/courses/fast-and-efficient-llm-inference-with-vllm…
Fast & Efficient LLM Inference with vLLM
Source: https://www.deeplearning.ai/courses/fast-and-efficient-llm-inference-with-vllm
What you’ll learn
- Apply quantization to shrink a model’s memory footprint, then measure the accuracy tradeoff.
- Serve a model with vLLM and see how efficiently it handles many concurrent requests using techniques like continuous batching and PagedAttention.
- Benchmark your deployment and measure model quality so you can make informed tradeoffs between speed, cost, and accuracy.
About this course
Introducing Fast & Efficient LLM Inference with vLLM, a short course built in partnership with Red Hat and taught by Cedric Clyburn, Senior Developer Advocate at Red Hat.
Serving open-source LLMs efficiently, for many users at low latency and reasonable cost, comes down mostly to memory management. Two things compete for that memory: the model weights and the KV cache. A 70-billion-parameter model takes around 140 GB of memory just for the weights, while the KV cache grows with every request you serve. In this course, you’ll learn to shrink the weights through quantization, and serve the model with vLLM, the widely adopted open-source serving system, taking advantage of the memory management techniques it provides like PagedAttention and prefix caching.
You’ll run the full optimize-deploy-benchmark workflow on a real model: compressing an open-source Qwen model with LLM Compressor, serving it with vLLM, and benchmarking your deployment under realistic traffic using GuideLLM and lm-eval.
In detail, you’ll:
- Understand why efficient LLM deployment matters, what happens during inference, what the KV cache is, and how the GPU memory hierarchy shapes performance.
- Explore LLM optimization fundamentals and how compression techniques like weight and activation quantization enhance a model’s throughput and latency while preserving accuracy.
- Use LLM Compressor to quantize a full-precision model, compare its size before and after, and use perplexity to measure whether the compressed model still performs well.
- Learn the three core techniques behind modern LLM serving: continuous batching to keep the GPU busy, PagedAttention to manage the KV cache without waste, and prefix caching to skip recomputation when requests share content.
- Connect to a vLLM inference server, send requests through the OpenAI-compatible API, and watch vLLM’s memory management techniques working live in the metrics.
- Benchmark your deployment under load with GuideLLM and evaluate model quality with lm-eval.
By the end, you’ll have run the full optimize-deploy-benchmark workflow on a real model and built the intuition to navigate the tradeoffs between accuracy, speed, and cost.
Who should join?
ML engineers, platform engineers, and developers who need to deploy open-source LLMs efficiently. Familiarity with Python and basic LLM concepts (tokens, inference, GPU usage) is recommended.
Course Outline
9Lessons・3Code Examples
Instructor

Cedric Clyburn
Senior Developer Advocate at Red Hat
Similar Articles
@AndrewYNg: New course: Transformers in Practice. You'll get a practical view of how transformer-based LLMs work, so you can reason…
New course 'Transformers in Practice' from deeplearning.ai and AMD teaches practical understanding of transformer-based LLMs, covering text generation, attention mechanisms, and inference optimization techniques like quantization and KV caching.
@ickma2311: Efficient AI Lecture 13: LLM Deployment Techniques The lecture helped me understand AWQ, vLLM, and FlashAttention very …
A lecture on LLM deployment techniques covering AWQ, vLLM, FlashAttention, quantization, and activation smoothing for efficient serving.
LLMs Go To Confession, Automated Scientific Research, What Copilot Users Want, Reasoning For Less
DeepLearning.AI launches 'Build with Andrew,' a course enabling non-coders to build web applications using AI in under 30 minutes, while research addresses LLM transparency issues including model honesty and automated scientific research capabilities.
@polydao: This Stanford lecture on AI inference will teach you more about how LLMs work in production than most ML courses > Clau…
A Stanford lecture on AI inference emphasizes practical bottlenecks like KV-cache and techniques like speculative decoding and continuous batching, offering more real-world insight than typical ML courses.
@AYi_AInotes: Fellow developers working on LLM production deployment, check out Andrew Ng's new course. The free version gives you access to all videos and base code. This course is not another rerun of the 'Attention is All You Need' math derivation, nor another set of mystical prompt-tuning tricks, nor yet another toy...
Andrew Ng has launched a new course on LLM production deployment. The free version provides access to all videos and base code. The course dives deep into LLM internals, inference optimization (such as quantization, KV Cache, Flash Attention, speculative decoding), and hardware-aware optimization. Taught by AMD's VP of Engineering, it aims to help developers transform Transformer from an academic concept into a debuggable, optimizable engineering tool.