Tag
SCAPE is a communication-efficient distributed optimizer that leverages first-moment statistics to enable extreme sparsification for LLM training, preserving accuracy while reducing wall-clock time by up to 43.3%.
Llama Surgery injects learned block-sparse attention topologies into pre-trained Llama 3.1 8B without retraining from scratch, using a Dynamic Topology Router with Gumbel-Softmax routing, temperature annealing, and a Straight-Through Estimator to avoid gradient collapse, achieving stable convergence and coherent output.