An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

Reddit r/MachineLearning 06/20/26, 12:27 PM Tools

llm-inference gpu-internals kv-cache batching open-source handbook developer-tools

Summary

An open, in-progress handbook explaining LLM inference internals including GPU memory hierarchy, KV cache, batching, and popular inference engines like vLLM and TensorRT-LLM.

I've been working through the internals of LLM inference and writing up what I learn as an open, in-progress handbook. Just wrapped another chapter on GPU execution and memory internals: why a GPU sits mostly idle during inference, how the memory hierarchy gates throughput, and where the real bottlenecks live. Added mermaid diagrams for the architecture pieces so the flow is easier to follow than a wall of text. It's a personal learning project, still growing chapter by chapter. I'd value feedback or corrections from anyone who's run inference in production, where my mental model breaks down is exactly what I want to find. Issues and PRs welcome. github.com/harshuljain13/llm-inference-at-scale

Original Article

Similar Articles

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

X AI KOLs

This article provides a comprehensive guide to LLM inference engines for local AI hardware in 2026, explaining how to choose based on hardware strategy, workload, and serving model, and covering engines like llama.cpp, MLX, ExLlamaV2/3, vLLM, SGLang, TensorRT-LLM, and NVIDIA Dynamo.

An open handbook on LLM inference at scale (GPU internals, KV cache, batching, vLLM/SGLang/TensorRT-LLM) [P]

Similar Articles

Inference Engines for LLMs & Local AI Hardware (2026 Edition)

LLMs 101: A Practical Guide (2026 Edition)

Local LLM Inference Optimization: The Complete Guide

Show HN: Tiny-vLLM – high performance LLM inference engine in C++ and CUDA

Personal continual learning for LLMs without GPU — position paper [OC]

Submit Feedback