Tag
This paper proposes a dense-to-sparse continual training method for LLMs, using a predictor-gated bank-wise sparsity to achieve 4x FFN sparsity, and demonstrates it on Qwen2.5-8B with long-context training.