Tag
ModeSwitch-LLM is a lightweight controller that routes LLM inference requests to appropriate fixed modes (e.g., FP16, quantization, speculative decoding) on a single GPU, achieving up to 2.10× latency speedup and 51.7% energy reduction without retraining the model.
The author speculates that loading only active parameters of MoE models onto GPUs could drastically improve efficiency and allow running large models like Kimi locally, though acknowledges this is currently impractical.
Unsloth open-source tool boosts large-model fine-tuning speed 2-5× and slashes VRAM by 80%, letting a single RTX 4090 finish in hours what once needed an A100 cluster.