Tag
This paper introduces Rotary GPU, an exploratory execution approach that enables running large Mixture-of-Experts models on consumer hardware with limited VRAM, achieving 21 tokens/s on an RTX 4060 with 8GB. It focuses on deployment accessibility rather than architectural improvements.