Tag
A pruned and quantized version of MiniMax-M3 (MiniMax-M3-Medium-JANG_2L) optimized to run on 128GB Macs using vMLX, featuring 32% expert pruning and JANG_2L mixed-precision quantization to fit within ~105 GB.
This paper demonstrates that cosine similarity is a poor proxy for assessing layer importance in LLMs, and proposes using the actual accuracy drop from layer removal as a more robust metric.
This paper introduces a resource-efficient pruning framework that identifies and removes parameters associated with unsafe behaviors in large language models while preserving utility. Using gradient-free attribution and the Lottery Ticket Hypothesis perspective, the method achieves significant reductions in unsafe generations and improved robustness against jailbreak attacks with minimal performance loss.