Stratum: System-Hardware Co-Design with 3D-Stackable DRAM for Efficient Moe
Summary
Introduces Stratum, a system-hardware co-design approach utilizing 3D-stackable DRAM to efficiently accelerate Mixture of Experts (MoE) models.
Similar Articles
Are the rich RAM /poor GPU people wrong here?
Discusses the trade-off between dense and Mixture-of-Experts (MoE) models for local AI, noting that high-RAM users have limited MoE options beyond Qwen 3.5 122B, and questioning if large GPU is the only viable path.
Multi Tier MoE Caching
Discusses multi-tier caching strategies for MoE models to improve inference speed by keeping frequently activated experts on GPU, referencing existing implementations like PowerInfer and llama.cpp branches.
A new way to build chips: Sequentially stacking silicon to extend Moore's Law
Researchers at the University of Illinois have demonstrated a scalable method to sequentially stack high-performance silicon circuits, achieving monolithic 3D integration within strict thermal budgets, which could extend Moore's Law beyond traditional transistor shrinking.
An Efficient Heterogeneous Co-Design for Fine-Tuning on a Single GPU
SlideFormer introduces a heterogeneous co-design for full-parameter LLM fine-tuning on a single GPU, leveraging GPU/CPU/RAM/NVMe with a layer-sliding engine and optimized Triton kernels, enabling fine-tuning of 123B+ models on a single RTX 4090 with significant throughput improvements.
DisagMoE: Computation-Communication overlapped MoE Training via Disaggregated AF-Pipe Parallelism
This paper introduces DisagMoE, a system for MoE training that optimizes computation-communication overlap by disaggregating attention and FFN layers across GPU groups. Implemented on Megatron-LM, it achieves up to 1.8x speedup on H800 clusters by addressing inter-node communication bottlenecks.