Tag
This paper presents HPC-LLM, a retrieval-augmented and domain-adapted assistant for HPC workflows, fine-tuning Llama 3.1 8B with QLoRA on HPC documentation. It demonstrates performance comparable to larger general-purpose models with significantly lower resource requirements.
Jane Street allowed Dwarkesh Patel to tour their new Texas data center with 4,032 GPUs, each rack pulling 140 kilowatts, highlighting the massive scale and unique networking choices.
A user showcases a DIY cluster of M5 Max MacBooks connected via Thunderbolt 5, highlighting the aggregate compute power and connectivity challenges.
The article derives the column elimination tree for the right-looking sparse Cholesky algorithm, explaining how it predicts fill-in and task dependencies without performing dense factorization.
This article introduces the Cornell Virtual Workshop's free online tutorial on basic CUDA programming using C, covering prerequisites and additional resources.
A 2019 blog post from FLOW Lab at BYU explores how to optimize Julia code to match C++ performance using a real-world aerodynamics application (vortex particle method) as a benchmark. The author shares lessons learned about achieving high-performance computing in Julia through type declarations, JIT compilation, and code optimization techniques.
DeepSeek releases DeepGEMM, a high-performance CUDA kernel library for LLM computation primitives including FP8/FP4/BF16 GEMMs, fused MoE with overlapped communication, and MQA scoring, compiled at runtime via JIT with no installation-time CUDA compilation required. The library achieves up to 1550 TFLOPS on H800 and matches or exceeds expert-tuned libraries across various matrix shapes.