gpu-efficiency

#gpu-efficiency

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

arXiv cs.LG ↗ · 2026-05-25 Cached

ModeSwitch-LLM is a lightweight controller that routes LLM inference requests to appropriate fixed modes (e.g., FP16, quantization, speculative decoding) on a single GPU, achieving up to 2.10× latency speedup and 51.7% energy reduction without retraining the model.

0 favorites 0 likes

#gpu-efficiency

@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

X AI KOLs Following ↗ · 2026-05-10

The author speculates that loading only active parameters of MoE models onto GPUs could drastically improve efficiency and allow running large models like Kimi locally, though acknowledges this is currently impractical.

0 favorites 0 likes

#gpu-efficiency

@AI_jacksaku: This week’s GitHub dark horse—Unsloth speeds up AI model training 2-5× while cutting VRAM use by 80%. What does that mean? Fine-tuning a large model used to require an A100 cluster and tens of thousands of dollars. Now one RTX 4090 can finish the job in a few hours. How? By optimizing attention compute, eliminating redundant memory copies, and adding QLoRA & Flash Attention support.

X AI KOLs Timeline ↗ · 2026-04-23 Cached

Unsloth open-source tool boosts large-model fine-tuning speed 2-5× and slashes VRAM by 80%, letting a single RTX 4090 finish in hours what once needed an A100 cluster.

0 favorites 0 likes

gpu-efficiency

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

@jun_song: If we ever figure out how to load ONLY the active params of an MoE into the GPU instead of the full weights, it's game …

Submit Feedback