Tag
Modal engineers detail their approach to achieving truly serverless GPUs for AI inference, combining cloud buffers, a custom content-addressed filesystem, and CPU/GPU checkpoint/restore to scale replicas in tens of seconds instead of minutes.