Tag
vLLM integrates Mooncake Store for distributed KV cache reuse, enabling cross-node prefix caching to efficiently serve agentic workloads with high token reuse.
A tweet recommending a paper that is described as the bible of distributed inference.
A blog post guides readers through setting up a Raspberry Pi cluster for distributed training and inference, part of a series aimed at making distributed AI accessible using affordable hardware.
antirez announces receiving an M5 Max 128GB MacBook Pro from audreyt to develop DwarfStar4 and experiment with distributed inference across M3 Max and M5 Max hardware.
Federation of Experts (FoE) restructures mixture-of-experts blocks into clusters that process KV heads independently, eliminating inter-node communication bottlenecks and improving inference throughput and latency by up to 5.2x while maintaining generation quality.
A user shares their $25k hardware setup of two 512GB RAM M3 Ultra Mac Studios for running large language models locally, having tested DeepSeek V3 Q8 and GLM 5.1 Q4 via the exo distributed inference backend, while awaiting Kimi 2.6 MLX optimization.