An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]
Summary
An open, in-progress handbook explaining LLM inference internals including GPU memory hierarchy, KV cache, batching, and popular inference engines like vLLM and TensorRT-LLM.
Similar Articles
Inference Engines for LLMs & Local AI Hardware (2026 Edition)
This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.
LLMs 101: A Practical Guide (2026 Edition)
A comprehensive practical guide to LLMs covering inference mechanics, tokens, Transformers, KV cache, local deployment hardware, and quantization as of May 2026.
Local LLM Inference Optimization: The Complete Guide
A comprehensive guide to optimizing local LLM inference on consumer hardware, covering tools like llama.cpp, vLLM, and LM Studio, with practical advice on memory hierarchy, layer placement, and common failure modes.
Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA
Tiny-vLLM is a high-performance LLM inference engine implemented in C++ and CUDA, offering features like continuous batching and PagedAttention, and serves as an educational resource.
Personal continual learning for LLMs without GPU — position paper [OC]
The author proposes two architectures, Internal KV-Sphere Architecture (IKSA) and Background Micro Fine-Tuning (BMFT), for enabling LLMs to learn continually from personal interactions without GPU requirements and without catastrophic forgetting.