Tag
Modal explains how it reduces AI inference cold starts by 40x using cloud buffers, a custom filesystem, checkpoint/restore, and CUDA checkpoint/restore, framing cloud buffer management as a linear optimization problem solved with GLOP.
Modal explains the four key ingredients they developed to spin up serverless GPU inference replicas in seconds instead of minutes, enabling efficient GPU allocation for variable AI workloads.