标签
一位研究人员推出了Shard,在跨开放互联网的6块消费级GPU上分布式的744B参数模型实现了30 tok/s推理,相较之前的方法提升了15-20倍。
A tweet advocating that every AI researcher should understand inference acceleration and highlighting CUDA Graph as a core component of the vLLM server for GPU efficiency.