Tag
A researcher debuted Shard, achieving 30 tok/s inference on a 744B parameter model distributed across 6 consumer GPUs over the open internet, a 15-20x improvement over previous methods.
A tweet urging AI researchers to learn inference-acceleration basics and spotlighting CUDA Graph as the key to vLLM’s GPU utilization.