Tag
This paper introduces TwELL and Hybrid sparse formats with custom CUDA kernels to efficiently leverage unstructured sparsity in LLMs, achieving over 20% faster training and inference on H100 GPUs while reducing energy and memory usage.
Skymizer announces the HTX301, a PCIe inference card capable of running 700B-parameter LLMs on-premises with high memory and low power consumption.